[HN Gopher] Designing a better strcpy ___________________________________________________________________ Designing a better strcpy Author : signa11 Score : 61 points Date : 2021-06-17 09:50 UTC (2 days ago) (HTM) web link (saagarjha.com) (TXT) w3m dump (saagarjha.com) | Tempest1981 wrote: | Something feels wrong about using memccpy to copy strings... ever | since I saw bugs where people used memcpy incorrectly. | | And is there a wchar_t version of memccpy? | WalterBright wrote: | An even better solution: | | https://www.digitalmars.com/articles/C-biggest-mistake.html | | With the lengths of strings known, copying becomes fast, safe, | and trivial. | Hackbraten wrote: | Not everyone is going to be happy about 8 bytes overhead for | every buffer. | | Also, not every C-style API will migrate. If you consume one of | those, you may still have to count the length, which costs | cycles. | scottlamb wrote: | > Not everyone is going to be happy about 8 bytes overhead | for every buffer. | | Then prefix it with four bytes (which, if you omit the | trailing NUL, adds three bytes in total). If your C strings | are longer than 2^32 bytes, you're probably doing something | wrong. | | I also should point out that memory is typically cheap. Cache | is more precious, and I'd expect on balance adding a length | would pollute the cache less than scanning through the whole | string unnecessarily. | | But I agree with your second point: | | > Also, not every C-style API will migrate. If you consume | one of those, you may still have to count the length, which | costs cycles. | | Similarly, the article started with this text: | | > Like them or not, null-terminated strings are essential to | C, and working with them is necessary in all but the most | trivial programs. | | You can do things nicely in your own code, but you still need | NUL-terminated strings when dealing with existing code. And | dealing with existing code is often the reason to pick C... | WalterBright wrote: | The proposal does not take away any C features. | loup-vaillant wrote: | Actually, it's 7 bytes (you can remove `\0` at the end), and | that's if you insist on being able to exceed 4GB with your | strings. | | Personally, depending on the use case, I'd be willing to have | strings limited to 256 bytes (no overhead), 64K bytes (1 byte | of overhead), and 4G bytes (3 bytes of overhead). Though | combining them all might be quite the nightmare... | codesnik wrote: | you can also have something akin to MIDI encoding with | higher bit reserved to mark that next byte is a part of | length field, too. up to 128 bytes with no overhead, up to | 16kb with one byte overhead, and so on. Such a detail would | be easy to hide in the library. But I wouldn't be surprised | that alignment and other troubles would eat any possible | gain. | WalterBright wrote: | Won't work with UTF-8. | tedunangst wrote: | The concern with performance seems a little overwrought. strlcpy | is only slow in the bad case where it truncates, which is ideally | not the common case. I've never heard or seen of a performance | bottleneck traced to a strlcpy in the hot path. | | If you really cared about performance, you'd be using nothing but | memcpy with careful length tracking. Regardless of algorithmic | runtime, any function that examines bytes as it copies will be | slower than a length based copy. | dataflow wrote: | I hate to nitpick on something so mundane and superficial, but | why in the world are people still writing code like this in 2020? | while (--len && (*dst++ = *src++)) ; | | Dereferenced post-increments are already confusing enough as-is, | and yet here we have 1 pre-increment and 2 _dereferenced_ _post_ | -increments happening on _top_ of an assignment in a | _conditional_ , all in a single expression. Even as someone who | _does_ put an assignment in a conditional once in a while, this | still feels 100% unjustifiable to me. It 's especially ironic | given the premise is that C code has security bugs... if the goal | is to avoid that, shouldn't there be even _more_ care taken to | avoid this kind of code? | abhishekjha wrote: | I guess this is so that something like this can be asked in | interviews to test how lucky the candidate is on that | particular day and hour. Otherwise everybody would pass the | interviews. | stefan_ wrote: | I don't want to use any of the str*cpy functions, all of them are | either braindead or missing in most libcs. At this point I'm all | in on snprintf(%s, foo). | saagarjha wrote: | I hate to do this, but: read the post. I mention snprintf | specifically: https://saagarjha.com/blog/2020/04/12/designing- | a-better-str.... It is neither efficient nor general-purpose as | a string copying routine. | [deleted] | okareaman wrote: | I expected to read about SIMD and copying 8 bytes at a time in 64 | bit registers, but I guess that is a compiler optimization | rbanffy wrote: | Would it make sense to extend the C standard to have a sensible | string type? | | /me ducks | pjmlp wrote: | It would, just like a sensible array type and proper | enumerations as well. | | Obviously WG14 doesn't care about it. | zabzonk wrote: | Referring to strcpy: | | > we can only use it if we know our destination buffer is smaller | than our source buffer | | Should be the other way around, surely? | Randor wrote: | Absolutely, | | That's not the only discrepancy. | | https://pubs.opengroup.org/onlinepubs/9699919799/functions/m... | | Note the following: "The memccpy() function does not check for | the overflow of the receiving memory area." "If copying takes | place between objects that overlap, the behavior is undefined." | | The strxcpy he provides at the bottom doesn't look better at | all. I'm not sure where the author got that function. I found | some better variants of the proposed strxcpy function with | bounds checks and that provides overflow detection. | saagarjha wrote: | I wrote strxcpy as an example of a function that satisfies | the requirements that I posted at the top. In doing so, it | becomes a useful routine for other kinds of needs, such as | this one: https://news.ycombinator.com/item?id=27564004 | | The reason I omitted the specific verbiage you quoted is that | they apply equally to all the functions. My function has an | important bounds check, but (like all of C) trusts that the | parameters you provide to it are correct. There is no | additional checking done because doing so in C is not | tenable. If implemented in a standard library it may be | useful to add additional heuristics to detect invalid cases, | but they are fundamentally best-effort and not worth showing | in an example. | saagarjha wrote: | LOL yes, thanks for catching that. I'll fix it when I'm at my | computer :) | | Edit: edited. I also had a messaging from past me that I had | forgotten about: <!-- I made a mistake | somewhere, didn't I? --> | saurik wrote: | > In the case where src fits in dst, it will return a pointer | past the NUL byte it placed; otherwise it returns NULL to | indicate a truncation. | | It is amazing to me how personal these preferences are ;P... | like, I'd be much happier with an API that always returns the | location of the NUL byte on success; and, if the string gets | truncated, then it instead returns dst+len (the address of the | byte past the end of the buffer). This allows for chained | constructions that provide efficient strcat-style semantics with | easy error propagation, such as this example which concatenates | three strings (which I honestly hope I got right... I'm giving | reasonable odds to Saagar telling me I've coded a buffer overflow | by accident somewhere ;P): char buf[X]; // for | any X, even 0! char *cur = buf; char *end = buf + | sizeof(buf); cur = strjcpy(cur, str1, end - cur); | cur = strjcpy(cur, str2, end - cur); cur = strjcpy(cur, | str3, end - cur); if (cur == end) goto fail; | kolbusa wrote: | Did you mean to check for (cur > end)? | saurik wrote: | No: if the function succeeds, the address of the NUL byte | will always be !=end (as it must be inside of the buffer, | which end is not); whereas, in the case of string truncation, | cur will be equal to end, as the function returns dst+len | (which is end); and, if that error had "already" happened | during a previous call, it will get propagated through the | next call as end-cur will be 0, causing the next call to | immediately fail (even for a 0-length string, which is an | important corner-case) and return cur+0 (which is still end). | saagarjha wrote: | Not seeing anything wrong, but it's C so who knows ;) | Thankfully, the primitive I (and memccpy) provide makes writing | your wrapper easy and efficient, as opposed to all the other | functions which don't compose at all. (From my phone) I think | this might work? char *strjcpy(char *dst, const | char *src, size_t len) { char *result = strxcpy(dst, | src, len); return result ? --result : result + len; | } | saurik wrote: | FWIW, part of the fun of strjcpy is that it is also much | easier to write than strxcpy (and watch as I go further and | further out on a limb with sketchy C that is likely wrong, | lol... I _did_ test it, at least! ;P): char | *strjcpy(char *restrict dst, const char *restrict src, size_t | len) { for (;; ++dst, --len) if | (!len || !(*dst = *src++)) return dst; | } | | (edit) Oh, I have an even cuter implementation (which might | look "too clever" but actually demonstrates something | important about the function)! Essentially, what makes | strjcpy so "pure" is that the NULL return is really a | "special case" in strxcpy that you're having to "undo" in | that wrapper, whereas the semantics of strjcpy--which may | _sound_ a bit weird--are mapping directly to what naturally | terminates the loop: running out of space on one of the two | inputs. This purity then gets taken advantage of by the | caller to get such easy call chaining and error propagation, | as one of these loops can "continue through" into the next | loop without any adaptation logic. char | *strjcpy(char *restrict dst, const char *restrict src, size_t | len) { for (; len && (*dst = *src++); ++dst, | --len); return dst; } | | (edit) Ok: one difference is that this implementation of | strjcpy (this isn't intrinsic to the return value surface I | described: just to the gloriously simple versions in this | comment; your adapted version, for example, is "fine") | doesn't put a NUL byte in case of truncation... though, I | personally am not at all sold on doing that: I want to firmly | fail the operation, rather than try to "use" the truncated | data :/. Adding that special case would still result in | strjcpy being simpler than strxcpy (and doesn't break its | semantics advantages: just make sure to return the address of | dst+len, not that extra NUL), but it isn't quite so amazingly | simpler at that point ;P. | [deleted] | vasama wrote: | Migrating to a pointer-size-pair string representation would be a | better use of one's time. | Gibbon1 wrote: | I swear part of the problem is with C there is a cargo cult | prohibition against passing small structs by value. | kzrdude wrote: | In other languages we create types for just about anything, | and like you say, it's strange that we don't do this more in | C. | Gibbon1 wrote: | Big problem with C is the standards committee just flat out | refuses to add an array type to C. It's deranged because if | you had first class arrays it'd be a lot easier to generate | code that takes advantage of SIMD instructions. | klodolph wrote: | You mean "add a second array type?" | | C has an array type, it's just a bit wacky, and it | doesn't play by the same rules as other types. | b5n wrote: | It's not a cult, its just that the cases where the risks of | passing by value would be worth any perceived advantage are | so few that it just doesn't make sense to even consider it. | | It's not like it's a flimsy tribal based claim, the guidance | is solid. | saurik wrote: | It does suck that the usual 64-bit calling convention limits | the size of strict passed by value to 64-bits :/. | thysultan wrote: | You could store two 32 bit fields and cast the second to a | pointer when needed, and the first treat as a length. | pjmlp wrote: | The cargo cult goes beyond that. | | The belief it was created alongside UNIX from the start, when | it was used to port UNIX V4 into high level language. | | Micro-optimizing each line of code as it is written, "because | it is fast", without even bothering to use a profiler. | | Even though lint was created alongside C to fix already known | programmer faults using the language, in 1979, the belief | that only bad programmers need such kind of tooling. | PaulDavisThe1st wrote: | > Micro-optimizing each line of code as it is written, | "because it is fast", without even bothering to use a | profiler. | | With the CPU, MMU and OS architectures of that period, it | wasn't particularly hard to infer what was fast without | profiling it. The slow rise in complexity at all 3 levels | now makes it hard for even extremely experienced close-to- | the-metal programmers to understand what will be fast or | slow without a profiler. Times do change, in fact. | radicalcentrist wrote: | Are there any recommended libraries for doing this, if I'd like | to migrate my C codebase? | macintux wrote: | You might look at SDS. | | https://github.com/antirez/sds | [deleted] | compiler-guy wrote: | And only works if you don't rely on any third-party libraries | that take normal C strings. | | Which is to say, is unrealistic for many programs. | pjmlp wrote: | Depends on much they value security, even std::string has an | extra null for _c_str()_ calls. | tgv wrote: | You can use a library that also adds a trailing \O. I used it | to interface C++. | [deleted] ___________________________________________________________________ (page generated 2021-06-19 23:01 UTC)