[HN Gopher] Sqids - Generate Short Unique IDs from Numbers
       ___________________________________________________________________
        
       Sqids - Generate Short Unique IDs from Numbers
        
       Author : vyrotek
       Score  : 285 points
       Date   : 2023-11-25 17:30 UTC (5 hours ago)
        
 (HTM) web link (sqids.org)
 (TXT) w3m dump (sqids.org)
        
       | dfc wrote:
       | It's weird under "Get Started" they have links to 40 different
       | languages. You can only get started with 15 of the 40 languages
       | listed, the other 25 are skeleton repos asking for people to
       | start the repo to indicate interest.
        
         | hooverd wrote:
         | Maybe a slam dunk first FOSS contribution?
        
           | ctoth wrote:
           | This seems like a perfect use case for an LLM :)
        
         | LeFever wrote:
         | It's kinda clever. The people most likely to look at this
         | project are also likely ideal candidates for implementing the
         | library in a new language (Developer, FOSS enthusiast,
         | interested in the project, need the library in a language
         | they're familiar with that isn't implemented yet).
         | 
         | Also, the language pills differentiate between those that have
         | been implemented (color logo, dark text, bold) and those that
         | aren't (grayscale).
        
           | 4kimov wrote:
           | Good points. Those pages also contain links to old
           | implementations (Hashids), because a lot of projects still
           | use those and want to be able to find them.
        
           | alas44 wrote:
           | Also can help track which languages people click on, probably
           | a good proxy of where there would be the need to develop a
           | lib
        
         | vyrotek wrote:
         | The approach definitely works. Some time ago I saw .NET listed
         | but discovered it wasn't complete. I was eager to replace an
         | existing Hashids implementation so I made some comments, shared
         | a starter-snippet, and then someone was excited enough to
         | complete in just a few days. It was great to see how quick the
         | community stepped in. Maybe there was a bit of Cunningham's Law
         | in effect with my contribution, ha.
         | 
         | https://github.com/sqids/sqids-dotnet/issues/2#issuecomment-...
        
       | c2xlZXB5Cg1 wrote:
       | Reminds me of proquints https://github.com/dsw/proquint
       | 
       | But 127.0.0.1 looks more "readable" to me than lusab-babad
        
       | whalesalad wrote:
       | This used to have a totally different name iirc, they used to be
       | called hashids
        
         | resoluteteeth wrote:
         | Yeah, it says that both in the page title and the logo at the
         | upper left
        
       | no_wizard wrote:
       | I like the idea, though I use nanoid with the safe letter
       | dictionary (it excludes letters used for profanity[0])
       | 
       | They should use a similar dictionary approach IMO because I
       | looked at the implementation and it's hardcoded to look for "bad"
       | words
       | 
       | Otherwise looks real straightforward! I'd love to see some
       | performance test suites for it
       | 
       | [0]: https://github.com/sqids/sqids-
       | javascript/blob/ebca95e114932...
       | 
       | [1]: though with UUID v4 so common to generate and well optimized
       | in most languages I wonder if these userland solutions are really
       | better. You can always generate a UUID and re-encode with base32
       | or base64 with also is well optimized in most languages
        
         | lxgr wrote:
         | > it excludes letters used for profanity
         | 
         | That doesn't seem possible. How would that work?
         | 
         | > I looked at the implementation and it's hardcoded to look for
         | "bad" words.
         | 
         | If you mean https://github.com/y-gagar1n/nanoid-good, that
         | seems to be doing the same thing.
         | 
         | In general, I'm a bit weary of solutions that "guarantee no bad
         | words" - this is usually highly language-specific: One
         | language's perfectly acceptable name is another language's
         | swear word.
        
           | no_wizard wrote:
           | This is the implementation:
           | https://github.com/CyberAP/nanoid-dictionary
           | 
           | We use it in a highly internationalized product spanning
           | multiple languages and haven't yet ran into a complaint or
           | value on audit that would constitute something offense in any
           | language per our intl content teams anyway.
           | 
           | That isn't to say it's 100% (and simply enough we don't audit
           | every single URL) but I suspect we would have gotten at least
           | a user heads up by now
           | 
           | Never the less we are moving our approach to uuids that get
           | base32 encoded for some of our use case for this. They're
           | easier to work for us in many scenarios
        
           | Sharlin wrote:
           | Omit vowels and you're 90% of the way there; omit the vowel-
           | looking digits 0,1,3,4 and you're probably >99% of the way
           | there.
        
             | gberger wrote:
             | fxck
        
               | Sharlin wrote:
               | Which is, evidently, why nanoids also excludes x and X,
               | as well as v and V (fvck).
        
           | Silasdev wrote:
           | It's particularly funny because their example docs for .NET
           | outputs "B4aajs", which to any Swedish l33t speaking
           | individual, would read "Bajs", which means "shit"
        
           | livrem wrote:
           | Looks like the dictionaries used are from this file?
           | 
           | https://registry.npmjs.org/naughty-words/-/naughty-
           | words-1.2...
           | 
           | From a quick look, the lists are pretty short, except for the
           | one with English words that at least have some 404 words, but
           | I can imagine there are far more bad words that you want to
           | avoid than just those?
        
             | ape4 wrote:
             | Here's the C++ of the sqid blocked words
             | https://github.com/sqids/sqids-
             | cpp/blob/main/include/sqids/b...
        
           | njharman wrote:
           | > That doesn't seem possible. How would that work?
           | 
           | agree; b00b, DlCK, cntfcker
           | 
           | But I suppose, if user doesn't get to craft input, the
           | collision space of converted numerical ids and words like
           | above is sufficiently small to be ignorable.
        
             | Sharlin wrote:
             | Besides vowels, nanoid excludes 0, 1, 3, 4, 5, I, l, x, X,
             | v, V, and other lookalikes, so the chances of generating
             | something naughty in _any_ language are close to zero.
        
         | tttp wrote:
         | I tried something similar with a fixed alphabet that guarantees
         | no profanity and a checksum (luhn)
         | 
         | https://github.com/tttp/dxid
        
       | dumbo-octopus wrote:
       | Odd design decision in that if you provide your own blocklist, it
       | overwrites their (extensive) default list instead of adding to
       | it.
       | 
       | And in general the algorithm is surprisingly complicated for
       | something that could be replaced with simply base64 encoding, the
       | given example (1,2,3) base64 encodes to a string with just one
       | more letter than this algorithm.
       | 
       | That said I do appreciate the semicolon-free-style. I don't
       | typically see that in libs besides my own.
       | 
       | https://github.com/sqids/sqids-javascript/blob/main/src/sqid...
        
         | 8organicbits wrote:
         | The problem is their block list will change over time. If you
         | don't override it, then your IDs won't decode right when you
         | update. This is a huge risk.
         | 
         | > You have to account for scenarios where a new word might be
         | introduced to the default blocklist
         | 
         | https://sqids.org/faq#future-blocklist
         | 
         | Honestly, I think they need to rethink this. Otherwise you've
         | got different library versions for different languages each
         | using different default blocklists, none of which are
         | compatible.
        
       | jsf01 wrote:
       | What's the use case for passing in an array of numbers? Typically
       | when generating an ID my input is either a single random number,
       | a string that's being hashed, or nothing at all.
        
         | 4kimov wrote:
         | [shard_number, primary_id_number, timestamp]
        
           | dumbo-octopus wrote:
           | But then why not just arbitrary text?
        
       | James_K wrote:
       | I guess they haven't heard of base-64.
        
         | xjia wrote:
         | Or base58, e.g.
         | https://api.rubyonrails.org/classes/SecureRandom.html#method...
        
         | dymk wrote:
         | That doesn't solve the same set of problems as TFA. Randomized
         | output order for sequential input, skips IDs that include
         | profanity.
        
         | majkinetor wrote:
         | One of the points is also to use custom alphabet.
        
       | canU4 wrote:
       | Sad that it is not for user ids
        
         | 8organicbits wrote:
         | I think that's only if you don't want to leak user count when
         | your ID is an autoincrement. Elsewhere people mention
         | cryptographicly remapping integers, which could work (by
         | itself, or before passing the ID to sqids).
        
       | packetlost wrote:
       | The name (but not function) seems really close to squuids from
       | Datomic/Clojure.
        
       | 3cats-in-a-coat wrote:
       | I don't get it, that's like two lines of code, why does it have a
       | library and even a domain
        
         | k2xl wrote:
         | Also wondering this
        
       | its-summertime wrote:
       | For a similar thing, (X bytes to X bytes, no collisions)
       | https://en.wikipedia.org/wiki/Format-preserving_encryption is a
       | good page
        
         | jchook wrote:
         | Also see Knuth Hash and k-dimensional equidistribution.
        
       | habitue wrote:
       | Skipping profanity seems like a liability in this design. It
       | means in order to preserve the encoding you need to make the
       | banned word list immutable, otherwise old sqids will decode to
       | the wrong thing when you get them back.
        
         | Etheryte wrote:
         | I don't think this holds, you can enforce filtering in the
         | encoding step, i.e. be strict about what you output, but always
         | decode, even if the input is profanity. This means you can also
         | be backwards compatible if you update the list etc. So in
         | short, the old maxim of be strict about your outputs and
         | lenient about your inputs.
        
           | fimdomeio wrote:
           | From their FAQ: "The best way to ensure your IDs stay
           | consistent throughout future updates is to provide a custom
           | blocklist, even if it is identical to the current default
           | blocklist."
        
             | Etheryte wrote:
             | In that case it sounds like a shortcoming on their part.
             | There is no fundamental reason to have that limitation. I
             | understand it can make the implementation easier to not
             | have it, but in my opinion being blocklist change agnostic
             | would be a much better value offering.
        
             | lights0123 wrote:
             | The *encoding* changes. The decoding stays consistent:
             | 
             | > Decoding IDs will usually produce some kind of numeric
             | output, but that doesn't necessarily mean that the ID is
             | canonical. To check that the ID is valid, you can re-encode
             | decoded numbers and check that the ID matches.
             | 
             | The reason this is not done automatically is that if the
             | default blocklist changes in the future, we don't want to
             | automatically invalidate the ID that has been generated in
             | the past and might now be matching a new blocklist word.
        
         | runlevel1 wrote:
         | The stupid simple way I did this ages ago was:
         | 
         | 1. Start with a-z.
         | 
         | 2. Drop all vowels, numbers, most homoglyphs, and the letter
         | 'x'.
         | 
         | 3. Map digits 0-9 to one of the remaining letters.
         | 
         | 4. Stringify the integer and replace the digit in each decimal
         | place with its corresponding character.
         | 
         | For my use-case, all the numbers were >7 digits long, so the
         | odds of you getting an offensive acronym were reasonably low
         | unless you started combining them.
         | 
         | But there's no perfect solution. As this dataset shows, you can
         | find offense in almost anything if you look hard enough:
         | 
         | California Personalized License Plate Requests Flagged for
         | Review 2015-2016:
         | https://docs.google.com/spreadsheets/d/18IUVU9Q4uN_lxqNd5AsN...
        
           | arp242 wrote:
           | Many of those reviewer comments are utterly moronic. And that
           | is my _polite_ opinion.
           | 
           | How does this work? Is there a review board? Is it put to
           | public review? A few of them like "dick out" and "shtlord"
           | are reasonable, but many of them seem so bonkers it looks
           | like the work of trolls.
           | 
           | Anyway, TIL that 1970s Intel was a MS-13 gang outfit and that
           | Octocat really means "eight vaginas".
        
           | air7 wrote:
           | > California Personalized License Plate Requests Flagged for
           | Review 2015-2016: https://docs.google.com/spreadsheets/d/18IU
           | VU9Q4uN_lxqNd5AsN...
           | 
           | Wow this is a funny peek into a weird perdicment where people
           | need to justify that they have a good reason to have a
           | specific license plate.
           | 
           | Some seems obviously ok such as:
           | 
           | INT13H
           | 
           | 314 PI
           | 
           | And some are obviously not:
           | 
           | DRY(hand emoji)JOB
           | 
           | DICK OUT
           | 
           | Come to think of it: Can license plates have emojis now?!
        
         | 8organicbits wrote:
         | Agreed, this is a big risk made worse that the default word
         | list can change over time.
         | 
         | https://sqids.org/faq#future-blocklist
        
         | kaetemi wrote:
         | It's a base62 encoder that takes multiple integers as input.
         | Probably a bit-length prefixed encoding. I am assuming it just
         | pads an extra junk integer to re-roll the encoded number.
        
       | 8organicbits wrote:
       | The mention of one-time passcodes seems odd. Those need to be
       | unguessable, but don't need to be unique. If you supply a
       | suitable random source, then I suppose it works, but the "padded
       | with junk" feature makes these look more complex than they really
       | are.
       | 
       | The standard choice of 4 to 8 random digits works well and it's
       | clear what level of security they provide. Digits are easier to
       | understand than case sensitive latin characters, especially when
       | your native language uses a different character set.
        
       | progne wrote:
       | In a Ruby app we just convert to a high base, like
       | > 1234567890.to_s(36)       => "kf12oi"
       | 
       | That gets us most of the way there, but Sqid has a Ruby library
       | and lets you set a much higher base, including upper case
       | characters, and I suppose, emoji. We're going to need much bigger
       | numbers before that space savings makes much difference. I like
       | it, but it's hard to know when something like that is worth
       | adding a dependency.
        
         | vyrotek wrote:
         | I believe a big part of the idea is for the hash to be
         | unpredictable as well.
         | 
         | If I figure out you're using (36) then I know the next number
         | 1234567891 is "kf12oj".
         | 
         | Not the case with Sqids.
        
           | hot_gril wrote:
           | You can easily brute-force this. Sqids also says it's not
           | good for sensitive data.
        
             | 8organicbits wrote:
             | It looks like an easy brute force too, there's no compute-
             | hard operations here. I guess you could scramble your
             | alphabet? Otherwise Uk always comes after bM, etc.
        
           | echelon wrote:
           | I'd prefer to use crockford-encoded entropy with Stripe-style
           | token prefixes to create unique ID namespaces. Run in through
           | a bad words filter, and it's perfect.
           | 
           | user_1hrpt0xpax7ps
           | 
           | file_xpax7psaz0tv6az0tv6
           | 
           | Etc.
           | 
           | In distributed systems you can use the trailing bytes to
           | encode things like author cluster, in case you're active-
           | active and need to route subsequent writes before create
           | event replication.
           | 
           | Easy to copy, debug, run ops/incall against. If you have an
           | API, they're user-friendly.
           | 
           | Of course you still want to instruct people the prefixes are
           | opaque.
        
             | wombatpm wrote:
             | Yeah don't forget the bad words filter. I worked on an IKEA
             | mailing where the list processing house was adding an
             | autogenerated discount code to the address label. The
             | customers received codes with BOOB, DICK, TWAT, and CUNT
             | embedded within. People were not happy.
        
               | otteromkram wrote:
               | Did they never make an IKEA purchase after that or did
               | they get over it like a normal adult?
               | 
               | I don't work retail, but something tells me people will
               | make a stink out of just about anything if it meant
               | potentially free products or other compensation.
               | 
               | Plus, are you filtering just English curse words or all
               | curse words for countries that use Latin characters?
        
           | pelagicAustral wrote:
           | Correct me if I'm wrong, but, It cannot be unpredictable,
           | which makes the library redundant for security concerns,
           | which would be the one business case to seek for anything
           | other than an UUID (which is already built into Ruby).
        
           | paulddraper wrote:
           | No, squids are predictable
        
         | candiddevmike wrote:
         | BaseEmoji is a thing: https://github.com/amoallim15/base-emoji
        
         | exxos wrote:
         | I didn't think of that, but this is a nice trick!
        
       | 1-6 wrote:
       | Sqids vs Squids. Missing the 'U' for unique but nevertheless a
       | unique shortened version of the regular spelling.
        
       | urza wrote:
       | I wanted to say that I use similar project called HashIDs, but I
       | see that HashIDs rebranded to Sqids :)
        
       | ComputerGuru wrote:
       | I haven't been able to find a case for this because ids either
       | need to be unique or they're not going to be large. If they're
       | unique, I'm using uuid or ulid (uuidv7 of tomorrow) as the
       | sortable primary key type to avoid conflicts without using the db
       | to generate and maintain sequences.
       | 
       | Where do you have unique ids that aren't the primary key? I would
       | be more interested in a retrospectively unique truncated encoding
       | for extant ulid/uuid; ie given that we've passed timestamp foo,
       | we know that (where no external data is merged) we only need a
       | bucketed time granularity of x for the random component of the id
       | to remain unique (for when sortability is no longer needed).
       | 
       | Or just more generally a way to convert a ulid/uuidv7 to a
       | shorter sequence if we are using it for external hash table
       | lookups only and can do without the timestamp component.
        
         | Bytewave81 wrote:
         | The idea is that you encode and decode database IDs with this.
         | You wouldn't save them separately unless you were using it for
         | a purpose other than shareable "identifiers" which don't leak
         | significant amounts of database state. Imagine something like a
         | link shortener where you want to provide a short link to users,
         | but don't want it to just be a number.
        
         | swyx wrote:
         | why is ulid the uuidv7 of tomorrow?
        
       | waffle_ss wrote:
       | I wrote a Ruby gem to address this problem of hiding sequential
       | primary keys that uses a Feistel network to effectively shuffle
       | int64 IDs: https://github.com/abevoelker/gfc64
       | 
       | So instead of                   /customers/1
       | /customers/2
       | 
       | You'll get something like
       | /customers/4552956331295818987
       | /customers/3833777695217202560
       | 
       | Kinda similar idea to this library but you're encoding from an
       | integer to another integer (i.e. it's format-preserving
       | encryption). I like keeping the IDs as integers without having to
       | reach for e.g. UUIDs
        
       | chupapimunyenyo wrote:
       | Hashids seems way better than their new implementation
        
       | Use wrote:
       | Why should you hide your user count?
        
         | sneak wrote:
         | The rate of change over time can be used against you; many
         | people consider their businesses' month-over-month growth (or
         | lack thereof) to be private information.
         | 
         | "$WEBSITE did 50,000 signups a month during the beginning of
         | the pandemic, but now struggles to sign up a thousand a week"
         | is a story.
        
       | Schnitz wrote:
       | It would be great to have a quick primer on why this is better
       | than what people typically homebrew, like base62 encoding a
       | random number.
        
         | sneak wrote:
         | Database PKs usually aren't random, which AFAIK is what is
         | usually used as the number in this case.
        
         | 8organicbits wrote:
         | If you use a random number then you need to store it somewhere
         | to map back to the original. Sqids is an encoding, you can
         | decode the sqid back to the original without storage overhead.
         | 
         | Features like the profanity filter avoid creating URL routes
         | like /user/cuntFh.
         | 
         | Cross language support allows interop between the encoder and
         | decoder across microservices written in different languages.
        
       | parhamn wrote:
       | Side note: there are some business insights you can get from a
       | company using serial ids.
       | 
       | i.e if you sign up and get user id 32588 and make another account
       | a few days later, you can tell the growth rate of the company.
       | 
       | And this is possible with every resource type in the application.
       | 
       | I do wonder how much the url bar junk thing matters these days. I
       | tend to use uulids (waiting on uuid v7 wide adoption), and
       | they're a bit ugly, but most browsers hide most of the urls now
       | anyway. The fact that there is a builtin time component comes in
       | clutch sometimes (e.g. object merging rules).
        
         | pacificmint wrote:
         | > you can tell the growth rate of the company.
         | 
         | You can even do this when you don't know the exact interval by
         | using probabilities. The Allies used this method to estimate
         | German tank production in World War II by analyzing the serial
         | numbers of captured or destroyed tanks.
         | 
         | This is know as the German Tank Problem [1]
         | 
         | [1] https://en.wikipedia.org/wiki/German_tank_problem
        
           | lhamil64 wrote:
           | It also makes it slightly easier to perform certain attacks
           | since it's trivial to figure out other IDs.
        
         | paulddraper wrote:
         | > most browsers
         | 
         | Not chrome...
         | 
         | Also, links are a think in chat, etc
        
           | parhamn wrote:
           | Heres what a recent youtube (which squid documents as a
           | sample use case) link I shared looked like:
           | 
           | > https://www.youtube.com/watch?v=fFMzQ3tYTFU&pp=ygURY2hQImVz
           | Z...
           | 
           | Or Twitter:
           | 
           | > https://x.com/elonmusk/status/172853302828286055507?s=20
           | 
           | Or TikTok:
           | 
           | > https://www.tiktok.com/@<userId>/video/73029257425923205478
           | 5...
           | 
           | While I tend to strip the tracking params and there are
           | extensions that do this, I don't think most people do. These
           | URLs are pretty 'ugly'.
           | 
           | So if the links that are being shared most on the internet
           | (YT, TikTok, Twitter) don't care, you probably shouldn't
           | either. I think the onus is on the UI layers (Chat apps, etc)
           | to show urls how they look best on their respective
           | platforms.
           | 
           | Edit: to this point, it looks like HN truncates these to make
           | them less ugly too.
        
       | swyx wrote:
       | saving it to my list of uid implementations
       | https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...
        
       | bufferoverflow wrote:
       | Do we really need a library for that? Shouldn't it be a simple
       | function?
        
       | dustingetz wrote:
       | anyone have a copy pasta for the widest possible alphabet (i.e.
       | extended unicode safe chars)
        
       | filleokus wrote:
       | > Not Good For:
       | 
       | > User IDs - Can be decoded, revealing user count
       | 
       | Suppose you don't want to leak the count, what's a resonable way
       | of implementing that?
       | 
       | You can of course have a uuid v7 / uulids or something as the
       | primary key. Or have it as a public facing primary key, mapping
       | back to a sequential ID PK (there might be some performance hits
       | with larger PK's in e.g postgres? or is that just fud?)
       | 
       | But you could also generate a public ID with something like
       | encrypt(seq_id, secret) and then encode it with whatever alphabet
       | and or profanity filter you'd like - right? The issue then is
       | that all public ID's would be long (and of course dealing with a
       | decrypt operation on all incoming requests).
       | 
       | Don't know what's best really.
        
         | 8n4vidtmkvmk wrote:
         | Add an offset, multiply by a large prime number, and modulo. I
         | don't think you can recover the original number without
         | figuring out the prime.
        
       | kryptogeist wrote:
       | Damn, those squids are getting smart
        
       | orf wrote:
       | How do you adjust or evolve the blocklist with this, without
       | making previously generated IDs incorrect?
       | 
       | The ID is simply incremented if it is blacklisted [1]. So the ID
       | is fixed to the blacklist content, and adjusting it in any way
       | invalidates certain segments of previously generated IDs?
       | 
       | 1. https://github.com/sqids/sqids-
       | rust/blob/9f987886bc06875d782...
        
       | revenga99 wrote:
       | is there anyway to generate short unique id's from UUID's?
       | snowflake is incredibly slow when joining UUID => UUID columns.
        
       | andrewstuart wrote:
       | What is the decimal range of these values?
        
       | sandstrom wrote:
       | Neat library!
       | 
       | We're using randomly generated strings for many things. IDs,
       | password recovery tokens, etc. We've generated millions of them
       | in our system, for various use-cases. Hundreds of thousands of
       | people see them every day.
       | 
       | I've never heard any complaints about a random content-id being
       | "lR8vDick4r" (dick) or whatever.
       | 
       | But nowadays our society is so afraid of offending anyone, that
       | profanity filters has extended all the way to database IDs and
       | password recovery tokens.
       | 
       | (there are some legit cases, like randomly generated IDs for user
       | profiles shared in public URLs, that users have to live with, but
       | even there just make the min length 8 and you're unlikely to have
       | any full-word profanity as the complete ID; put differently, I
       | don't understand why they made the block list an opt-out thing)
        
       ___________________________________________________________________
       (page generated 2023-11-25 23:00 UTC)