[HN Gopher] Sqids - Generate Short Unique IDs from Numbers ___________________________________________________________________ Sqids - Generate Short Unique IDs from Numbers Author : vyrotek Score : 285 points Date : 2023-11-25 17:30 UTC (5 hours ago) (HTM) web link (sqids.org) (TXT) w3m dump (sqids.org) | dfc wrote: | It's weird under "Get Started" they have links to 40 different | languages. You can only get started with 15 of the 40 languages | listed, the other 25 are skeleton repos asking for people to | start the repo to indicate interest. | hooverd wrote: | Maybe a slam dunk first FOSS contribution? | ctoth wrote: | This seems like a perfect use case for an LLM :) | LeFever wrote: | It's kinda clever. The people most likely to look at this | project are also likely ideal candidates for implementing the | library in a new language (Developer, FOSS enthusiast, | interested in the project, need the library in a language | they're familiar with that isn't implemented yet). | | Also, the language pills differentiate between those that have | been implemented (color logo, dark text, bold) and those that | aren't (grayscale). | 4kimov wrote: | Good points. Those pages also contain links to old | implementations (Hashids), because a lot of projects still | use those and want to be able to find them. | alas44 wrote: | Also can help track which languages people click on, probably | a good proxy of where there would be the need to develop a | lib | vyrotek wrote: | The approach definitely works. Some time ago I saw .NET listed | but discovered it wasn't complete. I was eager to replace an | existing Hashids implementation so I made some comments, shared | a starter-snippet, and then someone was excited enough to | complete in just a few days. It was great to see how quick the | community stepped in. Maybe there was a bit of Cunningham's Law | in effect with my contribution, ha. | | https://github.com/sqids/sqids-dotnet/issues/2#issuecomment-... | c2xlZXB5Cg1 wrote: | Reminds me of proquints https://github.com/dsw/proquint | | But 127.0.0.1 looks more "readable" to me than lusab-babad | whalesalad wrote: | This used to have a totally different name iirc, they used to be | called hashids | resoluteteeth wrote: | Yeah, it says that both in the page title and the logo at the | upper left | no_wizard wrote: | I like the idea, though I use nanoid with the safe letter | dictionary (it excludes letters used for profanity[0]) | | They should use a similar dictionary approach IMO because I | looked at the implementation and it's hardcoded to look for "bad" | words | | Otherwise looks real straightforward! I'd love to see some | performance test suites for it | | [0]: https://github.com/sqids/sqids- | javascript/blob/ebca95e114932... | | [1]: though with UUID v4 so common to generate and well optimized | in most languages I wonder if these userland solutions are really | better. You can always generate a UUID and re-encode with base32 | or base64 with also is well optimized in most languages | lxgr wrote: | > it excludes letters used for profanity | | That doesn't seem possible. How would that work? | | > I looked at the implementation and it's hardcoded to look for | "bad" words. | | If you mean https://github.com/y-gagar1n/nanoid-good, that | seems to be doing the same thing. | | In general, I'm a bit weary of solutions that "guarantee no bad | words" - this is usually highly language-specific: One | language's perfectly acceptable name is another language's | swear word. | no_wizard wrote: | This is the implementation: | https://github.com/CyberAP/nanoid-dictionary | | We use it in a highly internationalized product spanning | multiple languages and haven't yet ran into a complaint or | value on audit that would constitute something offense in any | language per our intl content teams anyway. | | That isn't to say it's 100% (and simply enough we don't audit | every single URL) but I suspect we would have gotten at least | a user heads up by now | | Never the less we are moving our approach to uuids that get | base32 encoded for some of our use case for this. They're | easier to work for us in many scenarios | Sharlin wrote: | Omit vowels and you're 90% of the way there; omit the vowel- | looking digits 0,1,3,4 and you're probably >99% of the way | there. | gberger wrote: | fxck | Sharlin wrote: | Which is, evidently, why nanoids also excludes x and X, | as well as v and V (fvck). | Silasdev wrote: | It's particularly funny because their example docs for .NET | outputs "B4aajs", which to any Swedish l33t speaking | individual, would read "Bajs", which means "shit" | livrem wrote: | Looks like the dictionaries used are from this file? | | https://registry.npmjs.org/naughty-words/-/naughty- | words-1.2... | | From a quick look, the lists are pretty short, except for the | one with English words that at least have some 404 words, but | I can imagine there are far more bad words that you want to | avoid than just those? | ape4 wrote: | Here's the C++ of the sqid blocked words | https://github.com/sqids/sqids- | cpp/blob/main/include/sqids/b... | njharman wrote: | > That doesn't seem possible. How would that work? | | agree; b00b, DlCK, cntfcker | | But I suppose, if user doesn't get to craft input, the | collision space of converted numerical ids and words like | above is sufficiently small to be ignorable. | Sharlin wrote: | Besides vowels, nanoid excludes 0, 1, 3, 4, 5, I, l, x, X, | v, V, and other lookalikes, so the chances of generating | something naughty in _any_ language are close to zero. | tttp wrote: | I tried something similar with a fixed alphabet that guarantees | no profanity and a checksum (luhn) | | https://github.com/tttp/dxid | dumbo-octopus wrote: | Odd design decision in that if you provide your own blocklist, it | overwrites their (extensive) default list instead of adding to | it. | | And in general the algorithm is surprisingly complicated for | something that could be replaced with simply base64 encoding, the | given example (1,2,3) base64 encodes to a string with just one | more letter than this algorithm. | | That said I do appreciate the semicolon-free-style. I don't | typically see that in libs besides my own. | | https://github.com/sqids/sqids-javascript/blob/main/src/sqid... | 8organicbits wrote: | The problem is their block list will change over time. If you | don't override it, then your IDs won't decode right when you | update. This is a huge risk. | | > You have to account for scenarios where a new word might be | introduced to the default blocklist | | https://sqids.org/faq#future-blocklist | | Honestly, I think they need to rethink this. Otherwise you've | got different library versions for different languages each | using different default blocklists, none of which are | compatible. | jsf01 wrote: | What's the use case for passing in an array of numbers? Typically | when generating an ID my input is either a single random number, | a string that's being hashed, or nothing at all. | 4kimov wrote: | [shard_number, primary_id_number, timestamp] | dumbo-octopus wrote: | But then why not just arbitrary text? | James_K wrote: | I guess they haven't heard of base-64. | xjia wrote: | Or base58, e.g. | https://api.rubyonrails.org/classes/SecureRandom.html#method... | dymk wrote: | That doesn't solve the same set of problems as TFA. Randomized | output order for sequential input, skips IDs that include | profanity. | majkinetor wrote: | One of the points is also to use custom alphabet. | canU4 wrote: | Sad that it is not for user ids | 8organicbits wrote: | I think that's only if you don't want to leak user count when | your ID is an autoincrement. Elsewhere people mention | cryptographicly remapping integers, which could work (by | itself, or before passing the ID to sqids). | packetlost wrote: | The name (but not function) seems really close to squuids from | Datomic/Clojure. | 3cats-in-a-coat wrote: | I don't get it, that's like two lines of code, why does it have a | library and even a domain | k2xl wrote: | Also wondering this | its-summertime wrote: | For a similar thing, (X bytes to X bytes, no collisions) | https://en.wikipedia.org/wiki/Format-preserving_encryption is a | good page | jchook wrote: | Also see Knuth Hash and k-dimensional equidistribution. | habitue wrote: | Skipping profanity seems like a liability in this design. It | means in order to preserve the encoding you need to make the | banned word list immutable, otherwise old sqids will decode to | the wrong thing when you get them back. | Etheryte wrote: | I don't think this holds, you can enforce filtering in the | encoding step, i.e. be strict about what you output, but always | decode, even if the input is profanity. This means you can also | be backwards compatible if you update the list etc. So in | short, the old maxim of be strict about your outputs and | lenient about your inputs. | fimdomeio wrote: | From their FAQ: "The best way to ensure your IDs stay | consistent throughout future updates is to provide a custom | blocklist, even if it is identical to the current default | blocklist." | Etheryte wrote: | In that case it sounds like a shortcoming on their part. | There is no fundamental reason to have that limitation. I | understand it can make the implementation easier to not | have it, but in my opinion being blocklist change agnostic | would be a much better value offering. | lights0123 wrote: | The *encoding* changes. The decoding stays consistent: | | > Decoding IDs will usually produce some kind of numeric | output, but that doesn't necessarily mean that the ID is | canonical. To check that the ID is valid, you can re-encode | decoded numbers and check that the ID matches. | | The reason this is not done automatically is that if the | default blocklist changes in the future, we don't want to | automatically invalidate the ID that has been generated in | the past and might now be matching a new blocklist word. | runlevel1 wrote: | The stupid simple way I did this ages ago was: | | 1. Start with a-z. | | 2. Drop all vowels, numbers, most homoglyphs, and the letter | 'x'. | | 3. Map digits 0-9 to one of the remaining letters. | | 4. Stringify the integer and replace the digit in each decimal | place with its corresponding character. | | For my use-case, all the numbers were >7 digits long, so the | odds of you getting an offensive acronym were reasonably low | unless you started combining them. | | But there's no perfect solution. As this dataset shows, you can | find offense in almost anything if you look hard enough: | | California Personalized License Plate Requests Flagged for | Review 2015-2016: | https://docs.google.com/spreadsheets/d/18IUVU9Q4uN_lxqNd5AsN... | arp242 wrote: | Many of those reviewer comments are utterly moronic. And that | is my _polite_ opinion. | | How does this work? Is there a review board? Is it put to | public review? A few of them like "dick out" and "shtlord" | are reasonable, but many of them seem so bonkers it looks | like the work of trolls. | | Anyway, TIL that 1970s Intel was a MS-13 gang outfit and that | Octocat really means "eight vaginas". | air7 wrote: | > California Personalized License Plate Requests Flagged for | Review 2015-2016: https://docs.google.com/spreadsheets/d/18IU | VU9Q4uN_lxqNd5AsN... | | Wow this is a funny peek into a weird perdicment where people | need to justify that they have a good reason to have a | specific license plate. | | Some seems obviously ok such as: | | INT13H | | 314 PI | | And some are obviously not: | | DRY(hand emoji)JOB | | DICK OUT | | Come to think of it: Can license plates have emojis now?! | 8organicbits wrote: | Agreed, this is a big risk made worse that the default word | list can change over time. | | https://sqids.org/faq#future-blocklist | kaetemi wrote: | It's a base62 encoder that takes multiple integers as input. | Probably a bit-length prefixed encoding. I am assuming it just | pads an extra junk integer to re-roll the encoded number. | 8organicbits wrote: | The mention of one-time passcodes seems odd. Those need to be | unguessable, but don't need to be unique. If you supply a | suitable random source, then I suppose it works, but the "padded | with junk" feature makes these look more complex than they really | are. | | The standard choice of 4 to 8 random digits works well and it's | clear what level of security they provide. Digits are easier to | understand than case sensitive latin characters, especially when | your native language uses a different character set. | progne wrote: | In a Ruby app we just convert to a high base, like | > 1234567890.to_s(36) => "kf12oi" | | That gets us most of the way there, but Sqid has a Ruby library | and lets you set a much higher base, including upper case | characters, and I suppose, emoji. We're going to need much bigger | numbers before that space savings makes much difference. I like | it, but it's hard to know when something like that is worth | adding a dependency. | vyrotek wrote: | I believe a big part of the idea is for the hash to be | unpredictable as well. | | If I figure out you're using (36) then I know the next number | 1234567891 is "kf12oj". | | Not the case with Sqids. | hot_gril wrote: | You can easily brute-force this. Sqids also says it's not | good for sensitive data. | 8organicbits wrote: | It looks like an easy brute force too, there's no compute- | hard operations here. I guess you could scramble your | alphabet? Otherwise Uk always comes after bM, etc. | echelon wrote: | I'd prefer to use crockford-encoded entropy with Stripe-style | token prefixes to create unique ID namespaces. Run in through | a bad words filter, and it's perfect. | | user_1hrpt0xpax7ps | | file_xpax7psaz0tv6az0tv6 | | Etc. | | In distributed systems you can use the trailing bytes to | encode things like author cluster, in case you're active- | active and need to route subsequent writes before create | event replication. | | Easy to copy, debug, run ops/incall against. If you have an | API, they're user-friendly. | | Of course you still want to instruct people the prefixes are | opaque. | wombatpm wrote: | Yeah don't forget the bad words filter. I worked on an IKEA | mailing where the list processing house was adding an | autogenerated discount code to the address label. The | customers received codes with BOOB, DICK, TWAT, and CUNT | embedded within. People were not happy. | otteromkram wrote: | Did they never make an IKEA purchase after that or did | they get over it like a normal adult? | | I don't work retail, but something tells me people will | make a stink out of just about anything if it meant | potentially free products or other compensation. | | Plus, are you filtering just English curse words or all | curse words for countries that use Latin characters? | pelagicAustral wrote: | Correct me if I'm wrong, but, It cannot be unpredictable, | which makes the library redundant for security concerns, | which would be the one business case to seek for anything | other than an UUID (which is already built into Ruby). | paulddraper wrote: | No, squids are predictable | candiddevmike wrote: | BaseEmoji is a thing: https://github.com/amoallim15/base-emoji | exxos wrote: | I didn't think of that, but this is a nice trick! | 1-6 wrote: | Sqids vs Squids. Missing the 'U' for unique but nevertheless a | unique shortened version of the regular spelling. | urza wrote: | I wanted to say that I use similar project called HashIDs, but I | see that HashIDs rebranded to Sqids :) | ComputerGuru wrote: | I haven't been able to find a case for this because ids either | need to be unique or they're not going to be large. If they're | unique, I'm using uuid or ulid (uuidv7 of tomorrow) as the | sortable primary key type to avoid conflicts without using the db | to generate and maintain sequences. | | Where do you have unique ids that aren't the primary key? I would | be more interested in a retrospectively unique truncated encoding | for extant ulid/uuid; ie given that we've passed timestamp foo, | we know that (where no external data is merged) we only need a | bucketed time granularity of x for the random component of the id | to remain unique (for when sortability is no longer needed). | | Or just more generally a way to convert a ulid/uuidv7 to a | shorter sequence if we are using it for external hash table | lookups only and can do without the timestamp component. | Bytewave81 wrote: | The idea is that you encode and decode database IDs with this. | You wouldn't save them separately unless you were using it for | a purpose other than shareable "identifiers" which don't leak | significant amounts of database state. Imagine something like a | link shortener where you want to provide a short link to users, | but don't want it to just be a number. | swyx wrote: | why is ulid the uuidv7 of tomorrow? | waffle_ss wrote: | I wrote a Ruby gem to address this problem of hiding sequential | primary keys that uses a Feistel network to effectively shuffle | int64 IDs: https://github.com/abevoelker/gfc64 | | So instead of /customers/1 | /customers/2 | | You'll get something like | /customers/4552956331295818987 | /customers/3833777695217202560 | | Kinda similar idea to this library but you're encoding from an | integer to another integer (i.e. it's format-preserving | encryption). I like keeping the IDs as integers without having to | reach for e.g. UUIDs | chupapimunyenyo wrote: | Hashids seems way better than their new implementation | Use wrote: | Why should you hide your user count? | sneak wrote: | The rate of change over time can be used against you; many | people consider their businesses' month-over-month growth (or | lack thereof) to be private information. | | "$WEBSITE did 50,000 signups a month during the beginning of | the pandemic, but now struggles to sign up a thousand a week" | is a story. | Schnitz wrote: | It would be great to have a quick primer on why this is better | than what people typically homebrew, like base62 encoding a | random number. | sneak wrote: | Database PKs usually aren't random, which AFAIK is what is | usually used as the number in this case. | 8organicbits wrote: | If you use a random number then you need to store it somewhere | to map back to the original. Sqids is an encoding, you can | decode the sqid back to the original without storage overhead. | | Features like the profanity filter avoid creating URL routes | like /user/cuntFh. | | Cross language support allows interop between the encoder and | decoder across microservices written in different languages. | parhamn wrote: | Side note: there are some business insights you can get from a | company using serial ids. | | i.e if you sign up and get user id 32588 and make another account | a few days later, you can tell the growth rate of the company. | | And this is possible with every resource type in the application. | | I do wonder how much the url bar junk thing matters these days. I | tend to use uulids (waiting on uuid v7 wide adoption), and | they're a bit ugly, but most browsers hide most of the urls now | anyway. The fact that there is a builtin time component comes in | clutch sometimes (e.g. object merging rules). | pacificmint wrote: | > you can tell the growth rate of the company. | | You can even do this when you don't know the exact interval by | using probabilities. The Allies used this method to estimate | German tank production in World War II by analyzing the serial | numbers of captured or destroyed tanks. | | This is know as the German Tank Problem [1] | | [1] https://en.wikipedia.org/wiki/German_tank_problem | lhamil64 wrote: | It also makes it slightly easier to perform certain attacks | since it's trivial to figure out other IDs. | paulddraper wrote: | > most browsers | | Not chrome... | | Also, links are a think in chat, etc | parhamn wrote: | Heres what a recent youtube (which squid documents as a | sample use case) link I shared looked like: | | > https://www.youtube.com/watch?v=fFMzQ3tYTFU&pp=ygURY2hQImVz | Z... | | Or Twitter: | | > https://x.com/elonmusk/status/172853302828286055507?s=20 | | Or TikTok: | | > https://www.tiktok.com/@<userId>/video/73029257425923205478 | 5... | | While I tend to strip the tracking params and there are | extensions that do this, I don't think most people do. These | URLs are pretty 'ugly'. | | So if the links that are being shared most on the internet | (YT, TikTok, Twitter) don't care, you probably shouldn't | either. I think the onus is on the UI layers (Chat apps, etc) | to show urls how they look best on their respective | platforms. | | Edit: to this point, it looks like HN truncates these to make | them less ugly too. | swyx wrote: | saving it to my list of uid implementations | https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No... | bufferoverflow wrote: | Do we really need a library for that? Shouldn't it be a simple | function? | dustingetz wrote: | anyone have a copy pasta for the widest possible alphabet (i.e. | extended unicode safe chars) | filleokus wrote: | > Not Good For: | | > User IDs - Can be decoded, revealing user count | | Suppose you don't want to leak the count, what's a resonable way | of implementing that? | | You can of course have a uuid v7 / uulids or something as the | primary key. Or have it as a public facing primary key, mapping | back to a sequential ID PK (there might be some performance hits | with larger PK's in e.g postgres? or is that just fud?) | | But you could also generate a public ID with something like | encrypt(seq_id, secret) and then encode it with whatever alphabet | and or profanity filter you'd like - right? The issue then is | that all public ID's would be long (and of course dealing with a | decrypt operation on all incoming requests). | | Don't know what's best really. | 8n4vidtmkvmk wrote: | Add an offset, multiply by a large prime number, and modulo. I | don't think you can recover the original number without | figuring out the prime. | kryptogeist wrote: | Damn, those squids are getting smart | orf wrote: | How do you adjust or evolve the blocklist with this, without | making previously generated IDs incorrect? | | The ID is simply incremented if it is blacklisted [1]. So the ID | is fixed to the blacklist content, and adjusting it in any way | invalidates certain segments of previously generated IDs? | | 1. https://github.com/sqids/sqids- | rust/blob/9f987886bc06875d782... | revenga99 wrote: | is there anyway to generate short unique id's from UUID's? | snowflake is incredibly slow when joining UUID => UUID columns. | andrewstuart wrote: | What is the decimal range of these values? | sandstrom wrote: | Neat library! | | We're using randomly generated strings for many things. IDs, | password recovery tokens, etc. We've generated millions of them | in our system, for various use-cases. Hundreds of thousands of | people see them every day. | | I've never heard any complaints about a random content-id being | "lR8vDick4r" (dick) or whatever. | | But nowadays our society is so afraid of offending anyone, that | profanity filters has extended all the way to database IDs and | password recovery tokens. | | (there are some legit cases, like randomly generated IDs for user | profiles shared in public URLs, that users have to live with, but | even there just make the min length 8 and you're unlikely to have | any full-word profanity as the complete ID; put differently, I | don't understand why they made the block list an opt-out thing) ___________________________________________________________________ (page generated 2023-11-25 23:00 UTC)