[HN Gopher] What is `Box<str>` and how is it different fro... ___________________________________________________________________ What is `Box<str>` and how is it different from `String` in Rust? Author : asimpletune Score : 103 points Date : 2022-06-24 09:54 UTC (1 days ago) (HTM) web link (mahdi.blog) (TXT) w3m dump (mahdi.blog) | sirwhinesalot wrote: | It's unfortunate that strings are badly named in rust. They got | that better with Path and PathBuf. | | str is fixed size, like a Java String | | String is growable, like a Java StringBuilder | | After that, we get into memory ownership, with &str not owning | memory, and Box<str> owning memory, but you rarely need the | latter, so it's really &str vs String that you need to care | about. | | EDIT: changed immutable to fixed and mutable to growable to | better reflect the real difference, though typically you almost | always use immutable &str and &mut String. I thank the commenters | below for pointing it out, I don't want to make the problem even | more confusing than it already is. | Arnavion wrote: | String used to be StrBuf first. The rename to String was | intentional because String was the more commonly known name in | other languages. | | https://rust-lang.github.io/rfcs/0060-rename-strbuf.html | lifthrasiir wrote: | Note that this is a very old RFC and doesn't have much | context and discussion compared to later RFCs. It is | worthwhile to read the actual discussion happened [1]. | | [1] https://github.com/rust-lang/rfcs/pull/60 | howinteresting wrote: | This was a mistake. Having str and StrBuf would have been | significantly less confusing than str and String. | steveklabnik wrote: | I often joke that this is the only change I'd desire for a | Rust 2.0. | OJFord wrote: | What about aliasing it, marking String as deprecated in | docs, 'please use StrBuf'? (Clippy warning, etc.) | steveklabnik wrote: | In theory you could do something like this, but it would | be a _lot_ of churn for a questionable amount of gain. I | probably wouldn 't support it today; Rust is past being | able to make these sorts of changes imho. | sirwhinesalot wrote: | Unfortunately, judging by the fact so many people are still | confused about it, it was a mistake. Having a shorthand for | something (str) and that thing (String) be different things | was dumb, and someone brought that up in the discussion at | the time but I guess hindsight is 20/20. | | C++ has std::string and std::string_view which makes a loads | more sense. | | Java and C# have StringBuilder and String. | | Go has strings.Builder and string. | | Objective-C/Cocoa has and NSMutableString and NSString. | | ADA has Unbounded_String, Bounded_String and Fixed_String for | different use cases. | | Rust has by far the worst naming. | kzrdude wrote: | I guess C++ has the best names after all, Rust should have | emulated those (except it couldn't - string_view came after | Rust and maybe even was inspired by Rust.) | cmrdporcupine wrote: | Chromium's C++ StringPiece dates back to at least 2012, | and pretty sure Google had something similar (I forget | that name) to it in Google3's C++ base library (which | became abseil's string_view) before that even. | | I seem to recall Boost may have had a string_view pretty | far back, too. | | https://chromium.googlesource.com/chromium/src/base/+/mas | ter... | | https://github.com/abseil/abseil- | cpp/blob/master/absl/string... | nicoburns wrote: | Personally I'd prefer String/StringView (and potentially Path | and PathView), but I guess that ship has sailed. | Blikkentrekker wrote: | I find that this explanation does not do justice | | The important part is that `str` is a dynamically sized type as | it's called. What it is is simply a region of memory, of any | size, containing UTF8. Since it is dynamically sized various | constraints are placed onto it which in practice come down to | that it can only really be passed around at runtime by being | behind a pointer and is hard to directly put on the stack. | | `String` is three words, two words are aequivalent to a "fat | pointer" to a `str`, as in one word for the address, and the | other for the size, which is how Rust deals with dynamically | sized types in general, and the third word denotes the capacity | of memory allocated to the `String` which it uses to know when | to reallocate. | | `str` is neither mutable nor immutable which isn't part of it's | type, `&str` is immutable, and `&mut str` is mutable. It's | perfectly possible in Rust to mutate a `str` if one obtains a | mutable, or perhaps better called exclusive reference to it | somehow, but the mutations that can be performed are very | limited since the size cannot easily grow. | | This is where `String` comes in, which guarantees that the | space after the `str` pointed to it, the size of it's | "capacity" third word is not used by anything else, and thus it | can grow more easily by manipulations. | | There are some limited mutation methods on `&mut str` in Rust, | such as `make_ascii_uppercase`, which converts all lowercase | ascii letters to uppercase, which is perfectly fine, since this | operation is guaranteed to not ever increase the size of the | `str`, but with unicode such a guarantee no longer applies and | one needs a `String`. | | That being said, yes, I would have favored for `String` to be | called `StrBuf`, and `Vec` `SliceBuf` instead. | sirwhinesalot wrote: | Sure, if you want to be truly specific about it and not do a | Java analogy ;) | aliceryhl wrote: | The difference has to do with ownership, and it has nothing to | do with mutability. For _both_ types, you can mutate them given | a mutable refence, and you can 't given an immutable reference. | | For an example, an `&mut str` can be modified via various | methods such as make_ascii_uppercase. | sirwhinesalot wrote: | Nope, not ownership either, Box<str> and String both own | their memory, the different is fixed size vs growable :) | | But you're right, I edited my post to reflect this, the Java | analogy is pretty strained as it is. | Macha wrote: | I believe the parent poster was comparing &str and String, | not Box<str> and String. | marcosdumay wrote: | > but you rarely need the latter | | AFAIK, it's because people go with String when what they | actually mean is Box<str>. Since they have similar costs, | nobody ever sees the need to change it, and the String type | does have a much better name. | | But the need is there all the time. People just satisfy it | differently. | sirwhinesalot wrote: | I think it's mainly because unlike Java, where a | StringBuilder is effectively an optimisation over | concatenating Strings, in Rust managing that memory would be | a total pain, so you tend to keep the mutable thing around. | | Once that happens, Box<str> becomes kinda unnecessary. There | are many cases where it would be the correct type, for | example reading from a file in a read-only manner, but most | of the time you're going to be doing _something_ to that | text, so it makes more sense to just load it up as a String | already and avoid the unnecessary copy. | | Either way, it's mostly a naming problem. &str/String sucks | :( | fpoling wrote: | String in Rust is very similar to std::string in C++, while str | is std::string_view except it is safe to use. | | StringBuffer in Java is not like String in Rust. In particular, | one cannot pass StringBuffer in Java to a function taking | String, while both Rust and C++ allow to implicitly convert the | string backed by a heap into the corresponding read-only view. | sirwhinesalot wrote: | Strings in Java own their memory, they aren't views, they're | closer to Box<str>. That's why you can't implicitly convert a | StringBuilder into one. | | I know this, I'm not the one you need to explain it too, it's | Rust newbies. So many problems would have been avoided with | Str/StrBuf or StrView/Str, but now the ship has sailed. | rrobukef wrote: | String in Java share their memory with other substrings of | the same allocation. They are views. | cesarb wrote: | IIRC, that used to be the case, but recent Java releases | changed it so that memory is no longer shared with | substrings. The former behavior could cause some extreme | memory leaks (unless you were very careful to always | manually duplicate each substring); a one-character | substring could keep a multi-megabyte memory allocation | alive. See for instance | https://stackoverflow.com/questions/33893655/string- | substrin... which discusses this issue. | OJFord wrote: | If OP is here, then in this listing: let | boxed_str: Box<str> = "hello".into(); println!("size of | boxed_str on stack: {}", std::mem::size_of_val(&boxed_str)); | let s = String::from("hello!"); println!("size of string | on stack: {}", std::mem::size_of_val(&s)); | | I know it's not the point and doesn't make a difference, but you | might want to make the two 'strings' the same (not with & without | '!'), just to be clearer. | umanwizard wrote: | This might clarify the situation, for C or C++ folks: | // heap-allocated, fixed-size struct BoxStr { | unsigned length; // INVARIANT: this points to a heap | allocation of length bytes, and is valid utf8 | unsigned char *data; } // heap-allocated, | resizable struct String { unsigned length; | unsigned capacity; // INVARIANT: heap allocation of | capacity bytes, the first length of which are valid utf8 | unsigned char *data; } | | Of course you _could_ resize BoxStr, but only by reallocating | `data` to the exact desired length every time, which will kill | your asymptotic complexity. | tylerhou wrote: | Is your first example really equivalent to Box<str>? I would | have expected something like using BoxStr = | std::unique_ptr<Str>; | | where Str is defined as struct Str { | size_t len; char data[]; }; | | The difference is that the len is stored on the heap, and the | data is stored inline with the length. Unfortunately C++ does | not support flexible array members so this syntax is not | actually valid. | | Edit: Never mind, after reading the article Rust does use the | above representation because Box holds a "fat" pointer to str, | which stores it's length on the stack. So BoxStr is the correct | equivalent, because &[u8] is not equivalent to u8*, it's | equivalent to std::span<u8>. | steveklabnik wrote: | Your parent is correct, the length is stored alongside the | pointer, not on the heap with its data. This is true for any | "dynamically sized type," not just Box<str>. &str is also a | (pointer, length) pair, for example. | the__alchemist wrote: | I'm working on a PC-based configuration for a drone flight | controller. PC-side is std Rust with a stack available. Firmware | is `no-std`, running on a microcontroller. It has waypoints you | can program when connected to a PC using USB. They have names | that need to be represented as some sort of string. | | I'm using `u8` arrays for the strings on both sides; seems the | easiest to serialize, and Rust has `str::from_utf8` etc to handle | conversion to/from the UI. | | `String` is unsupported on the MCU side since there's no | allocation. I find this low-level approach ergonomic given it's | easy to [de]serialize over USB. | sampo wrote: | Title is: What is Box<str> and how is it different from String in | Rust? | dang wrote: | Fixed now. Thanks! | codedokode wrote: | Is there official documentation about what `str` (without an | ampersand) is? For example, documentation [1] says that `str` is | a "string slice" (without explaining what "string slice" mean), | and then goes on with description of &str. | | And a book on Rust [2] says: | | > A string slice is a reference to part of a String | | This seems wrong, because &str can reference static strings which | are not String. And if str, or "string slice" is a "reference", | then &str is a reference to a reference? | | And later: | | > The type that signifies "string slice" is written as &str | | But the documentation said that "string slice" is str, not &str. | | Also, I wonder, what do square brackets mean when they are used | without an ampersand (as s[0..2] instead of &s[0..2])? | | Also, is an ampersand in &str the same as an ampersand in &u8 | (meaning an immutable reference to u8) or does it have other | meaning? | | [1] https://doc.rust-lang.org/std/primitive.str.html | | [2] https://doc.rust-lang.org/book/ch04-03-slices.html#string- | sl... | [deleted] | LegionMammal978 wrote: | > Is there official documentation about what `str` (without an | ampersand) is? For example, documentation [1] says that `str` | is a "string slice" (without explaining what "string slice" | mean), and then goes on with description of &str. | | A `str` is really just a `[u8]` with extra semantics. Thus, a | `&str` is really a `&[u8]`, a `&mut str` is a `&mut [u8]`, a | `Box<str>` is a `Box<[u8]>`, etc. So we call it a "string | slice", since it mostly acts like a regular `[T]` slice. | | In general, the term "slice" can either refer to the unsized | type `[T]` or the reference `&[T]`/`&mut [T]` interchangeably. | You could also call the latter a "slice reference" where the | distinction is important; e.g., a `Box<[T]>` would be a "boxed | slice", while `Box<&[T]>` would be a "boxed slice reference" or | "boxed reference to a slice". But most of the time, the correct | meaning can be inferred from context. | | > Also, I wonder, what do square brackets mean when they are | used without an ampersand (as s[0..2] instead of &s[0..2])? | | `s[0..2]` is a place expression that refers to the raw `str` | subslice. But since `str` is an unsized type [0], it cannot | appear on its own; it must appear behind some reference type. | Thus, `&s[0..2]` creates a `&str`, and `&mut s[0..2]` creates a | `&mut str`. However, the ampersand isn't always necessary: you | can write `s[0..2].to_owned()` to use the `str` as a method | receiver, which implicitly creates a reference. | | [0] https://doc.rust-lang.org/book/ch19-04-advanced- | types.html#d... | ruuda wrote: | The & in &str is like the & in &[u8], str is like [u8] (an | unsized type), not like u8. A &str is a "fat pointer" (pointer | + length), unlike &u8 which is a regular "thin" pointer. | FullyFunctional wrote: | This is missing a conversation about | https://lib.rs/crates/compact_str (and a few alternatives like | it). TL;DR: String takes the space of three pointers, that is, 24 | bytes on 64-bit archs. compact_str fits up to 24 byte strings in | the same space and reverts to String for longer strings. | | ADD: that is, avoids heap allocation for those, unlike both | Box<str> and String. | tialaramex wrote: | Box<str> is still going to be smaller _if_ you know how big the | text is because (unlike CompactString and String) it doesn 't | need to carry a capacity value. In exchange of course you can't | append things to it (without re-allocating) | | CompactString is a very clever+ SSO implementation, and I'll | remember it is there if I run into a situation where it might | help but I firmly agree with Rust's choice _not_ to implement | the SSO optimisation in the standard library 's String type. | | + Storing 23 UTF-8 codepoints as one of several representations | in a 24 byte data structure makes sense, you can see how to | write a fairly safe SSO optimisation for Rust which does that, | but the CompactString scheme relies on the fact Rust's strings | are by definition UTF-8 encoded to squeeze the discriminant | into the same space as the last possible byte of an actual | UTF-8 string, so it can store a 24 byte value like | "ABCDEFGHIJKLMNOPQRSTUVWX" inline despite also distinguishing | the case where it needs a heap pointer for larger strings. | That's very clever. | rtfeldman wrote: | > I firmly agree with Rust's choice not to implement the SSO | optimisation in the standard library's String type. | | Out of curiosity, why is that? | | I don't know much about how or why that decision was made, | but I'm curious. | lifthrasiir wrote: | SSO means that pretty every string operation has multiple | code paths, which can be highly unpredictable. Basically it | is a trade-off between memory usage and performance, and | the standard library is not really a good place to make | that trade-off. By comparison many C++ codes (still) copy | strings all over the place for no good reason, so SSO in | the standard library has a much greater appeal. | pornel wrote: | A nice thing is that all string types have &str as the lowest | common denominator, so even if you use SSO or on-stack or any | other fancy string type, it's automatically compatible with | almost everything. | terhechte wrote: | I recently gave a Rust workshop to Kotlin and Swift developers. | Strings in Rust are a really, really difficult topic for complete | newcomers because they're understood as a basic type whereas in | Rust they require having read half the Rust book to grasp. | | Consider: I can teach a lot of Rust basic with `usize`. Defining | funcions, calling functions, enums because they're `Copy` and | because there's only one type. String requires knowing about &str | which requires knowing about deref which requires knowing about | (&String -> &str), it also requires understanding lifetimes, | moving, heap and stack, cloning. Then, if you want to work with | the file system you also need to understand Paths, OsString and | AsRef. | | With Kotlin and Swift, for all these things, you really just need | one type, String, and you handle it just like usize. | | It is really a bid of a hurdle for new developers coming from | higher level languages (especially if they just give it a quick | try). | klabb3 wrote: | Don't worry. As soon as you explain to them that appending to a | PathBuf is O(1) amortized they'll come around, and it will | scale much better for all their GB-sized file paths. | | I guess this adds a prerequisite on complexity theory but | nobody should go anywhere near advanced data structures like | strings with less than a bachelor in CS. | lijogdfljk wrote: | Makes me wonder if there could be room for a SimpleString | library. | | I love/use Rust. I don't think any of this is complicated. BUT, | i'm a big fan of just "clone your problems away" for beginner | Rust users. Going knee deep into techniques which merely reduce | memory usage when people likely don't actually care - at all - | about it just feels wrong to me. | | So yea, maybe a cursed library where SimpleString is just some | niceties around some Cow + Arc thing which is also Copy. Hell, | you could probably just apply it Vec and who knows what else. | | Anyway, clearly not something i'm advocating anyone _really_ | use. But it seems a nice way to make stuff "Just Work" in the | beginning. | kzrdude wrote: | Some weird construction around Cow + Arc that is also Copy is | not really possible in Rust, I'm sorry to report. No way to | implement it and even if you could (you technically "can" by | reimplementing most of Cow and Arc) - the result is not | useful, the destructor of it doesn't work. | codedokode wrote: | But Rust is designed to write high-performance code. If you | don't care about performace, you don't really need Rust. | Swift or Go seem more readable and easier to use. | pjmlp wrote: | Swift is pretty much about performance, as replacement for | C, C++ and Objective-C in the Apple ecosystem, it is even | on Apple's official sites. | | What Apple isn't willing to do is sacrifice productivity | while achieving that goal. | howinteresting wrote: | Swift is well-designed but is virtually non-existent | outside of Apple platforms, so it doesn't have nearly the | third-party ecosystem that Rust does. Go has the third- | party ecosystem but is poorly designed and doesn't have | basic language features like sum types. | | Rust is likely the best combination of thought-out design | and ecosystem support that exists in a programming language | today. | pjmlp wrote: | Rust is also pretty much focused on Linux workloads, | mostly. | | Also the Apple ecosystem has plenty of third parties, | including commercial libraries. | jeroenhd wrote: | Interestingly, Microsoft is also pushing Rust quite hard | with special API packages, tutorials, and even some IDE | integration. Windows tools are often closed source, | though, so you'll probably never notice it if your | favourite tool uses Rust or not. | agumonkey wrote: | rust has one uphill battle in the mainstream adoption is that a | lot of things make sense if you wrote bare metal code. If not | then it can be very confusing. | tialaramex wrote: | I think I'd recommend teaching Move semantics not Copy | semantics from the outset, because Move semantics work fine | everywhere in Rust and the Copy semantics are just an | optimisation. As you've found, if you teach Copy then for types | which aren't Copy you now need to teach Move. | | Languages like Kotlin and Swift are doing a _lot_ of lifting to | deliver this behaviour for String, and of course they can 't | keep it up, so students who've done more than a little Kotlin | or Swift will be aware of the idea of "reference semantics" in | those languages where most of the objects they use do not have | the behaviour they've seen in String which is instead | pretending to be a value type like an integer. | | Again, if you only teach Move, you're fine. After not very long | a student will wonder how they can duplicate things (since they | didn't know Copy), and you can show them Clone. Clone works | everywhere. Is cloning a usize idiomatic Rust? No it is not. | Does it work just fine anyway? Of course it does! And of course | Clone is implemented for String, and for most types beginners | will ever see. | hgomersall wrote: | Are copy semantics always used in place of move semantics for | a Copy type? I didn't know that. | [deleted] | tialaramex wrote: | Literally all that Copy does is it says after assignment | the moved-from variable can still be used. So in this | sense, sure, these semantics are "always used". But if you | don't use the variable after assigning from it, you could | also say the semantics aren't used in this case. Does that | help? Copy does a _lot_ less than many people think it | does. | | If you're a low level person it's apparent this is because | Copy types are just some bits and their meaning is | literally in those bits, _Copy_ the bits and you 've copied | the meaning. Thus, this "it still works after assignment" | Copy behaviour is just how things would work naturally for | such types. But Rust doesn't require programmers (and | especially beginners) to grok that. | | It's possible to explain Copy semantics first in a way | that's easier to grasp for people coming from, say, Java, | but that's only half the picture because your students will | soon need Move semantics which are different. Thus I | recommend instead explaining Move semantics from the outset | (which will be harder) and only introducing Copy as an | optimisation. | | I think this might even be better for students coming from | C++, because C++ move semantics are a horrible mess, so | underscoring that Move is the default in Rust and it's fine | to think of every assignment as Move in Rust will avoid | them getting the idea that there must be secret magic | somewhere, there isn't, C++ hacked these semantics in to a | finished language which didn't previously have Move and | that's why it's a mess. | | I'm less sure for people coming from low-level C. I can | imagine if you're going to work with no_std on bare metal | you might actually do just fine working almost entirely | with Copy types and you probably need actual bona fide | pointers (not just references) and so you end up needing to | know what's "really" going on anyway. If you're no_std you | don't have a String type anyway, nor do you have Box, and | thus you can't write Box<str> either, although &str still | works fine if you've burned some strings into your firmware | or whatever. | afdbcreid wrote: | This isn't really something you usually encounter, but I | have to bring this cute example: pub fn | foo() -> impl FnOnce() { let non_copy: String = | String::new(); let copy: i32 = 123; | || { drop(non_copy); // Works | drop(copy); // error[E0373] } } | | https://play.rust- | lang.org/?version=stable&mode=debug&editio... | lumost wrote: | Rust strings are difficult for others coming from statically | typed and low level languages as well. | | It's one of the types programmers will most often encounter, | and yet it's one of the most obtuse topics within rust. | k__ wrote: | I remember strings being "not so easy" in C/C++ too. | oconnor663 wrote: | I think the big differences are that copying and reference | taking are automatic and invisible in C++. So a lot of APIs | taking string or string& will "just work" for the | beginners, and you can delay the part where you talk about | how different those things are. | | This sounds like a minor difference, but I've met lots of | developers who do meaningful work in C++ but who don't know | what a copy constructor is. I get the impression that | there's an enormous difference between being a C++ "user" | vs a "library writer", because there's so much automatic | stuff happing under the covers. | | Rust tends to have a bit less invisible complexity, I | think, but some of that difference is just making the | complexity visible (like reference taking), which | effectively frontloads it onto beginners. It's a tough | tradeoff. | jokethrowaway wrote: | After haskell strings, rust strings actually felt reasonable | nicoburns wrote: | On the plus side, String makes a really good example to explain | ownership, moving, stack vs heap, etc. All of which you need at | least a basic understanding of to do anything non-trivial in | Rust. | | I kind of feel like it goes without saying that Rust isn't | ideal for beginners. For developers who already have a good | knowledge of other languages I feel like learning about these | things shouldn't be a problem, as becoming familiar with these | concepts is one of the main benefits of learning Rust. | smaddox wrote: | > I kind of feel like it goes without saying that Rust isn't | ideal for beginners. | | I think that depends on, first, what the goal is, and second, | what you're comparing to. It think Rust is easier on | beginners, in many ways, than C. And C is easier on | beginners, in many ways, than assembly or machine code. But | if you want to really understand computer programming, | starting at machine code or at least assembly isn't a crazy | way to start. | tialaramex wrote: | Beginning with machine code for some simple architecture | (maybe RISC-V these days?) might be one good route in. | | I can also see (having experienced it myself, albeit I | already knew C etc. these were not requirements and many of | my classmates did not) beginning with a pure functional | language where all the practicalities are abstracted | entirely. | | Today the University where I learned this begins with Java, | which I am confident is the wrong choice, but the person | who part-designed their curriculum, and is a friend, | disagrees with me and he's the one getting paid to teach | them. | msla wrote: | > But if you want to really understand computer | programming, starting at machine code or at least assembly | isn't a crazy way to start. | | I've long suspected that the CS field was founded on two | approaches: The people who started from EE and worked their | way up, and the people who started from Math and worked | their way down. The former people think assembly is the | "real" way to approach software, and probably view C++ as | "very high-level", whereas the latter people think everyone | should start with a course on the lambda calculus and type | systems and gradually ease into Haskell, work down to Lisp, | and then maybe deign to learn Python for * _shudder_ * | numerical work. | nicoburns wrote: | I'd argue there's also a 3rd foundation of CS: language. | Programming languages really are languages in the general | sense of the word, and their purpose is to allow humans | to effectively communicate with machines. Focussing on | optimising that communication is the 3rd approach. | nicoburns wrote: | > It think Rust is easier on beginners, in many ways, than | C. And C is easier on beginners, in many ways, than | assembly or machine code. But if you want to really | understand computer programming, starting at machine code | or at least assembly isn't a crazy way to start. | | I mean sure. But equally, starting with Python isn't a | crazy way to start. And Python is much easier language to | learn than any of those (esp. if you want to actually | create something practical with it). | hgomersall wrote: | Sure, but if your objective is systems programming, | you'll probably quickly get to the point of realising | python is not the right choice. | pjmlp wrote: | Depends, if writing a compiler is still considered | systems programming in modern times. | | https://www.amazon.com/Writing-Interpreters-Compilers- | Raspbe... | less_less wrote: | Compilers are their own beast -- I wouldn't put them with | systems code. They're pretty different from an OS, BLAS, | machine learning kernel, game engine, network stack, | database or what have you. There's not as much buffer | management, speed and memory aren't usually as critical, | you don't make direct syscalls, many structures are | graphs rather than arrays, etc. They often aren't even | multithreaded. | | It's also popular to write compilers in distinctly | non-"systems-y" languages, most notably Standard ML but | also eg Haskell, and lots of languages are self-hosted. | nicoburns wrote: | If your objective is specifically systems programming | then you'll quickly outgrow python, but I'm not convinced | that makes it the wrong starting point. For systems | programming you'll likely need _both_ high-level and low- | level programming concepts. Learning low-level first is | absolutely a valid path, but my point is that going high- | level first is equally valid. People on the internet like | to make out like someone who starts out by learning | Python are incapable of later learning low-level | concepts, but if anything they 're at an advantage | compared with someone with no programming experience at | all. | nvrspyx wrote: | This is just my opinion, but I can't imagine systems | programming being the objective of any beginner. A | beginner probably wouldn't even be able to differentiate | systems programming from applications programming. | jez wrote: | Do any of the string types in the Rust standard library implement | the same sort of small string optimization that C++ libraries | implement for std::string? (explained here[1]) | | Some quick searching turned up a few rust-lang internals posts | and GitHub issues, but it was hard to see whether anything came | of them. | | I understand that it's probably possible to implement a | comparable String API in a crate that uses small string | optimizations, but being able to avoid a dedicated crate makes | interoperability with other libraries much easier. | | [1] https://tc-imba.github.io/posts/cpp-sso/ | aaaaaaaaaaab wrote: | https://github.com/rust-lang/rust/issues/20198 | edflsafoiewq wrote: | Not in std, no. | steveklabnik wrote: | Rust's standard library strings cannot because of a specific | API, as_mut_vec, which is incompatible with the internal | representation necessary to do SSO. | 24bytes wrote: | https://github.com/ParkMyCar/compact_str | | https://old.reddit.com/r/rust/comments/t33hxp/announcing_com... | dochtman wrote: | The tl;dr doesn't quite make sense to me. To me the core | difference is that a Box<str> takes one less word on the stack, | because by virtue of the str being immutable it doesn't need to | track the capacity of the allocation as distinct from the length. | This is analogous to Box<[u8]> vs Vec<u8> (and in fact those are | the same data types except for the guarantee of valid UTF-8). | tialaramex wrote: | One notable difference is that ToOwned for &str gives you a | String, whereas ToOwned for &[u8] gives you a [u8] by cloning | the slice you have. | | In fact all four standard library types that are ToOwned | without invoking Clone are more or less strings (str, CStr, | OsStr, Path) | tines wrote: | C++ programmer here: which one guarantees valid utf8, and why | would a primitive container make guarantees about the values | it's storing? | lifthrasiir wrote: | Everything labelled as "string" is a valid UTF-8 string in | Rust, and to my knowledge this decision was made very early | in the history of Rust (before 0.1). Many "modern" languages | (including modern enough C++) have a distinction between | Unicode strings and byte strings however they are called and | Rust just followed the suit. | Animats wrote: | "str" and "String" guarantee UTF-8. To make a String from an | array of bytes, call pub fn from_utf8(vec: | Vec<u8, Global>) -> Result<String, FromUtf8Error> | | which consumes the input Vec and returns it unmodified, if | it's valid UTF-8,, or reports an error, if it's not. There | are a number of related functions in this family. Such as | pub fn from_utf8_lossy(v: &[u8]) -> Cow<'_, str> | | which takes in a slice of bytes and checks if it's a UTF-8 | string. If it is, it returns the original str. Otherwise it | makes a copy with any errors replaced with the Unicode error | character. | | Vec<u8> and array slices such as &[u8] are primitive | containers - they can store any sequence of u8 values. String | is more like an object with access methods. | pornel wrote: | The guarantee exists to speed up UTF-8 processing, so that it | can safely assume working with whole codepoints/sequences | (without extra out of bounds checks for every byte) and to | ensure you can always losslessly roundtrip every string to | and from other Unicode encodings without introducing any | special notion of a broken character. There's also a security | angle in this: text-processing algorithms may have different | strategies for recovering from broken UTF-8, which could be | exploited to fool parsers (e.g. if a 4-byte UTF-8 sequence | has only 3 bytes matching, do you advance by 3 or 4 bytes?). | | Having the "valid UTF-8" state being part of the type system | means it needs to be checked only once when the instance is | created (which can be compile-time for constants), and | doesn't have to be re-checked later, even if the string is | mutated. Unlike a generic bag of bytes, the pubic interface | on string won't allow making it invalid UTF-8. | ntoskrnl wrote: | > why would a primitive container make guarantees about the | values it's storing | | If you know you have valid UTF-8, you can safely skip bounds | checks when decoding a codepoint that spans multiple bytes. ___________________________________________________________________ (page generated 2022-06-25 23:00 UTC)