hngopher.com

       [HN Gopher] Show HN: Using stylometry to find HN users with alte...
       ___________________________________________________________________
        
       Show HN: Using stylometry to find HN users with alternate accounts
        
       Author here. This site lets you put in a username and get the users
       with the most similar writing style to that user. It confirmed
       several users who I suspected were alts and after informally asking
       around has identified abandoned accounts of people I know from many
       years ago. I made this site mostly to show how easy this is and how
       it can erode online privacy. If some guy with a little bit of
       Python, and $8 to rent a decent dedicated server for a day can make
       this, imagine what a company with millions of dollars and a couple
       dozen PhD linguists could do.  Here's Paul Graham:
       https://stylometry.net/user?username=pg  Here are some frequent HN
       commenters: (EDIT: Removed due to privacy concerns)
        
       Author : costco
       Score  : 394 points
       Date   : 2022-11-26 18:03 UTC (4 hours ago)
        
 (HTM) web link (stylometry.net)
 (TXT) w3m dump (stylometry.net)
        
       | oblib wrote:
       | I've only had one account here. The highest match has a 0.624
       | score and the lowest a 0.572. I'm not sure if that means I'm
       | unique or common but I'd like to know.
        
       | macintux wrote:
       | My nearest match is only at 0.406. It'd be interesting to see who
       | the most unique commenters are, but it's also quite possible it
       | wouldn't be flattering.
        
         | joisig wrote:
         | 0.2506 is my nearest match
        
         | pubby wrote:
         | 0.35 is my nearest. In hopes of lowering it even further, here
         | are some nonsensical opinions never expressed on HN before: 1)
         | Programming peaked with COBOL 2) Paul Graham is responsible for
         | 90% of SIDS cases 3) There's no reason to use car when cdr
         | exists.
        
       | seydor wrote:
       | Well the only solution is too have too many alts so that nobody
       | can believe you can possibly have that many
        
       | WalterBright wrote:
       | Over in the D language forums, we welcome people who post under a
       | pseudonym, and our policy is we won't allow attempts to unmask
       | them.
       | 
       | This is to protect high profile users who are secretly enjoying
       | programming in D rather than the language they are supposed to
       | use.
       | 
       | And, of course, to protect users who feel they might be
       | discriminated against if their background was known.
        
         | bo1024 wrote:
         | It's very important for those people to be aware of these style
         | analysis attacks! Glad this post is raising awareness.
        
         | [deleted]
        
       | [deleted]
        
       | xwolfi wrote:
       | Wow... how !
        
       | notduncansmith wrote:
       | This has been a great way to find people whose commentary I
       | enjoy!
        
       | dsr_ wrote:
       | This is interesting.
       | 
       | I'm 0.566 correlated with logfromblammo -- and while we are
       | definitely not the same person, I could easily imagine writing a
       | sentence such as:
       | 
       | "For some bizarre reason, management has not yet assigned a task
       | to their programmer underlings to automated themselves out of
       | existence. I can't imagine why."
       | 
       | which is theirs, not mine, from about a year ago. I like that.
       | 
       | On the other hand, I'm nearly as correlated with peterwwillis:
       | 0.5485 -- who has no comments and no submissions.
        
         | costco wrote:
         | > On the other hand, I'm nearly as correlated with
         | peterwwillis: 0.5485 -- who has no comments and no submissions.
         | 
         | This is due to the Firebase API not updating when users ask the
         | admins to move their comments to another account.
        
         | lifeisstillgood wrote:
         | I had a similar experience finding my most likely alt (.50
         | suggesting I am a unique snowflake as I have always thought
         | :-), my most likely alt is writing certainly in a style I
         | appreciate and on subjects I often mention.
        
       | culi wrote:
       | Similar to how they make adversarial fashion[0][1] in order to
       | not be tracked by face id AI, I wonder if we can make adversarial
       | stylometry tools to run your comments through in order to
       | anonymize it
       | 
       | .. [0] https://hackaday.com/2022/10/20/render-yourself-invisible-
       | to...
       | 
       | .. [1] https://adversarialfashion.com/
        
         | carewell wrote:
         | OP links to a paraphrasing tool on their website.
        
       | pugworthy wrote:
       | Strip leading/trailing white space from the name if it says no
       | match.
        
       | robertlagrant wrote:
       | Clicking on my top match (0.61) - I can see the similarity. I
       | also note they quote the same way, with a > symbol. I wonder if
       | that helps!
        
       | lostmyacctoops wrote:
       | I'd be _very_ curious to know if these algorithms can link very
       | different _types_ of text. I 'm not surprised that my style is
       | "derivable" on HN, but what if you included my slash-fic pieces,
       | my research papers, etc, would it still "catch" me?
       | 
       | Also, talk about a chilling effect. I was already vaguely aware
       | of this, and now I'm overthinking every word I'm thinking/typing.
        
       | uberduper wrote:
       | I would have expected to be a closer match to myself.
       | 
       | > uberduper: 0.9999999999999991
        
       | birdyrooster wrote:
        
         | sdwr wrote:
         | Ooooeeoo oo oooo. Ooooeeooo oooo<<barbra striesand>>
        
           | wizzwizz4 wrote:
           | Yes, sadly. In this case, it'd be an arsehole move, but good
           | point.
        
         | phnofive wrote:
         | If you want to ask HN to remove your data, send a message to
         | hn@ycombinator.com.
        
         | CharlesW wrote:
         | Not to diminish one bit how you're feeling, but the bright side
         | is: Today you know this is easily done (information you didn't
         | have yesterday), that the creator had no intention of "outing"
         | you specifically, and that you can take steps to obfuscate this
         | specific aspect of your posts that connects your public alts.
        
       | dibt wrote:
       | Since it looks for similar word usage, false positives seem to
       | appear more often when specific topics are talked about, like
       | stocks or crypto.
       | 
       | Does this ignore stop words? Or do all words have the same
       | weighting? I wonder if only focusing on stop words would give a
       | more accurate measure. Maybe we are more comfortable with certain
       | stop words more than others?
       | 
       | https://en.wikipedia.org/wiki/Stop_words
       | 
       | "Stop words are the words in a stop list (or stoplist or negative
       | dictionary) which are filtered out (i.e. stopped) before or after
       | processing of natural language data (text) because they are
       | insignificant."
        
         | costco wrote:
         | All words have the same weighting. I don't ignore stop words,
         | in fact most of the ngrams I use are compromised almost
         | entirely of stop words. Maybe it'd be more effective if I
         | ignored them.
        
       | afarviral wrote:
       | Im tempted to use it to find likeminded friends :)
        
       | SnowHill9902 wrote:
       | Anything like this for Reddit?
       | 
       | Would translating to other language and back defend against this
       | algorithm?
        
         | costco wrote:
         | > Anything like this for Reddit?
         | 
         | No but it would be easily adaptable especially given that
         | Pushshift is archiving every Reddit comment. Based on some of
         | the feedback I'm getting here I don't know if I should open
         | source this even though it really wasn't that hard to make.
         | 
         | > Would translating to other language and back defend against
         | this algorithm?
         | 
         | Yes. But then you have to send your original comment to a
         | translation company so there are privacy concerns there too.
        
           | operator-name wrote:
           | I wouldn't worry about that too much as someone's already
           | done something similar for reddit
           | (https://towardsdatascience.com/using-nlp-to-identify-
           | reddito...), and has released their code publicly
           | (https://github.com/jabraunlin/reddit-user-id)
           | 
           | Given the technique used, I don't see why something simple
           | and local wouldn't defeat it? The "easiest" technique would
           | be to use this weighting as a negative metric in rewriting.
        
           | hcs wrote:
           | > But then you have to send your original comment to a
           | translation company so there are privacy concerns there too.
           | 
           | There are modern offline translation systems available such
           | as Bergamot https://browser.mt/
        
           | EMIRELADERO wrote:
           | > Based on some of the feedback I'm getting here I don't know
           | if I should open source this even though it really wasn't
           | that hard to make.
           | 
           | I'd say you should. I'd rather see this as being publicly and
           | freely available to everyone rather than some shady "Big
           | Tech" analytics company.
           | 
           | If the "weapons" exist, I would feel more comfortable knowing
           | everyone can access them, not just an elite that can use it
           | for their own (selfish) purposes.
        
             | A4ET8a8uTh0 wrote:
             | I am genuinely torn, because my initial reaction was almost
             | the exact opposite, but the comparison to a weapon does
             | ring true. And there is indeed an argument to be made for
             | level playing field. At the very least, maybe counter-
             | measures can be developed.
        
               | Terretta wrote:
               | People don't usually understand privacy risks till their
               | own curtains fall down.
        
           | [deleted]
        
       | AtlasBarfed wrote:
       | What's a high correlation number?
        
       | ThrowawayTestr wrote:
       | Haha, you got me and my main account. That's spooky.
        
       | [deleted]
        
       | jonnycomputer wrote:
       | Obviously the next thing to do is make this a popup on someone's
       | account name when you hover over it.
        
       | psychphysic wrote:
       | Hmmm, doesn't seem to work. But you have convinced me (and many
       | others?) to search our alts consecutively and so now do know who
       | has alts?
        
       | elorant wrote:
       | Sounds like a nice tool to find friends. You locate people who
       | might think like you.
        
       | RepAgent wrote:
       | What's up with cluster of users like:
       | 
       | j_s,password4321,carolinew,colinwright,kuharich etc.
       | 
       | https://stylometry.net/user?username=j_s
       | https://stylometry.net/user?username=carolinew
       | https://stylometry.net/user?username=colinwright
       | https://stylometry.net/user?username=password4321
       | 
       | Lowest match for j_s is 0.80 and all but one is black.
        
         | [deleted]
        
       | saurik wrote:
       | Why are some users bold?
        
         | srean wrote:
         | The non-bold are dead accounts I think
        
           | saurik wrote:
           | It isn't due to a mere property of the user, as, for example,
           | cushman is not bold as the #2 result for tptacek but is bold
           | as the #2 result for icambron.
        
             | srean wrote:
             | Good point.
        
             | stavros wrote:
             | FYI, the GP said above that bold usernames are those for
             | which symmetry holds (ie they're both in each other's top
             | ten).
        
         | costco wrote:
         | Say you see user2 listed in bold on user1's page. That means
         | that user1 is also in user2's top 20 users. In my experience it
         | is often an indicator of a good match (but not always). I
         | should probably explain that on the site.
        
           | layer8 wrote:
           | Instead of making it binary, you could use a gradient
           | indicating the strength of the mutual correlation (like how
           | HN colors downvoted comments).
        
       | franze wrote:
       | totally on spot
       | 
       | my current and my old account
        
       | jonnycomputer wrote:
       | Well, one of the closest on my list is my twin, so there's that.
        
       | Retr0id wrote:
       | It didn't find my alt, but the second match is one of my twitter
       | mutuals - I wonder if we've inadvertently borrowed style quirks
       | from each other.
        
         | [deleted]
        
       | hgsgm wrote:
        
       | [deleted]
        
       | samwillis wrote:
       | Sticking myself in (I haven't ever had another account) my
       | closest match (at 0.43) is the maintainer of an Open Source
       | project which I have occasionally commented about. They are also
       | British, as am I.
       | 
       | My guess is that as they commonly mention the project and I have
       | on a number of occasions, that has formed the link. Plus maybe
       | usage of common British terms, but that seems far less
       | significant.
       | 
       | It's super interesting!
       | 
       | It would be good if there were more controls to filter the type
       | of words and language that are used for the matching algorithm.
       | So you could say exclude words not in the dictionary. I wander
       | how that would effect my link with this other person.
        
       | WaitWaitWha wrote:
       | I checked a few random user names and I am confused.
       | 
       | - Why is the author costco[0] not in this lookup?
       | 
       | [0]: https://stylometry.net/user?username=costco
        
         | [deleted]
        
         | Aachen wrote:
         | - Their first comment and submission were 4 hours ago.
         | 
         | - The text on that page is accurate it seems.
        
       | julienreszka wrote:
       | why is my username not exactly equal to 1?
       | https://stylometry.net/user?username=julienreszka
        
         | costco wrote:
         | Python/floating point rounding error. It doesn't mean anything.
        
       | bhaney wrote:
       | Well now I'm self conscious about my closest match being an 0.34
       | when so many other people are reporting much closer matches with
       | accounts that aren't alts. Do I write weirdly?
        
         | spapas82 wrote:
         | Same for me, the closest match is 0.36. But I expected that
         | because I don't speak english very well so the pool of
         | candidates is small.
        
         | CobaltFire wrote:
         | My closest is 0.40, so I'm right there with you.
         | 
         | Native English speaker as well.
        
         | klohto wrote:
         | 0.36 here! Out of curiosity, are you a native speaker?
        
           | bhaney wrote:
           | I am, yes.
        
             | quink wrote:
             | 0.39 for myself, I'm a non-native speaker.
        
       | stephc_int13 wrote:
       | What is the threshold to be reasonably confident that two
       | accounts are from the same individual?
       | 
       | I ever had only one account here and the closest match is at
       | 0.47.
        
       | jefftk wrote:
       | Tried my account thinking "I don't have any alts" but it turns
       | out I do! In 2018 I changed my username from "cbr" to "jefftk"
       | and it pulled that right up:
       | https://stylometry.net/user?username=jefftk
        
       | CobaltFire wrote:
       | Interesting; I must have a fairly unique style as there are no
       | matches over 0.40 for me.
       | 
       | I'm a native English speaker as well, so I'm unsure how to feel
       | about that.
        
       | SkyMarshal wrote:
       | Oddly, I am not an exact match to myself.
       | 
       |  _> Most likely candidates:
       | 
       | skymarshal: 0.9999999999999997_
       | 
       | The other few usernames I tested (pg, dang, some random ones from
       | this thread) all matched themselves at 1.0.
        
       | ChrisMarshallNY wrote:
       | Interesting, but it gave me 20 accounts, and I _know_ that I only
       | have this one.
        
       | notacoward wrote:
       | Seems pretty spot-on to me. I tried it with two accounts I was
       | already certain were alts - based on other factors like favorite
       | topics and common enemies as well as style/tone - and the top
       | hits for both were the ones I would have expected.
        
       | nwiswell wrote:
       | I don't have an alt but it would be cool to meet my stylometry-
       | neighbors. I'm curious whether the writing similarity translates
       | to oral communication too
        
       | kiernanmcgowan wrote:
       | Love a little NLP project on a public dataset - thanks for
       | sharing!
        
         | [deleted]
        
       | jimhi wrote:
       | Amazing and I thought my doxxing tool was terrifying -
       | https://news.ycombinator.com/item?id=32278871
       | 
       | I am afraid to combine all these methods
        
         | lijogdfljk wrote:
         | Yea.. i guess it's time to stop bothering with alt
         | accounts/etc. I'll just make one account, maybe differently
         | named on different services (makes scraping just a _pinch_
         | easier) but aside from that all i can do is modify/remove old
         | posts.
         | 
         | Bit of a shame for useful posts/discussions.. but the internet
         | is getting really.. finger print laden.
        
       | timeon wrote:
       | I had hard time to understand some comments made by my closest
       | match. I guess this is good reality check. I need to learn how to
       | write more legible posts now.
        
         | FartyMcFarter wrote:
         | Sorry, what did you mean? :P
        
       | schappim wrote:
       | Interesting that the Op doesn't come up in the search:
       | https://stylometry.net/user?username=costco
        
         | Beltalowda wrote:
         | Not surprising considering the account had no activity before
         | today.
        
         | Aachen wrote:
         | Their first comment and submission were 4 hours ago. Text on
         | the page is accurate it seems.
        
       | 4qz wrote:
       | This is an evil website. We won't have any anonymity soon. The
       | highest match is my years old banned account that I forgot about.
       | Where did you get the data from?
        
         | JadeNB wrote:
         | > This is an evil website. We won't have any anonymity soon.
         | The highest match is my years old banned account that I forgot
         | about. Where did you get the data from?
         | 
         | I'd way rather have someone tell me "look at all the things I
         | can find out about you" so that I can act accordingly (whatever
         | that means!) rather than what we've mostly actually got, which
         | is companies silently exploiting my data and doing everything
         | they can to mumble reassuring but legally ineffective formulas
         | assuring me that they deeply respect my privacy.
        
         | costco wrote:
         | HN Firebase API. I just wrote a program in C++ with libcurl to
         | get https://hacker-news.firebaseio.com/v0/item/1.json,
         | https://hacker-news.firebaseio.com/v0/item/2.json,
         | https://hacker-news.firebaseio.com/v0/item/3.json, ...
        
         | ufmace wrote:
         | I don't know that I'd call this evil. We have no idea who else
         | is using this kind of technology but not making the results
         | public. Better to know what's possible and take measures to
         | make it less effective.
        
         | weinzierl wrote:
         | Please don't shoot at the messenger. costco shared this
         | voluntarily and I can see no bad intention.
         | 
         | We should see it as an opportunity to learn how easy it is to
         | associate different pseudonymous accounts. Nothing drives this
         | point home better than a practical demo.
         | 
         | We can be pretty sure stylometry is used widely by bad actors
         | already and we should not punish people who help to spread the
         | word about these technical possibilities.
        
           | ghaff wrote:
           | And this is actually quite a simple approach--which is
           | interesting in and of itself. While there would be
           | diminishing returns, there are a ton of other techniques you
           | could use to make stronger inferences about similarity.
        
         | [deleted]
        
         | vfinn wrote:
         | Imagine using this across different platforms :/, and let alone
         | using different techniques in addition...
         | 
         | edit: maybe you'd catch some criminals if you tried to match
         | reddit against dark web for example
        
         | woodruffw wrote:
         | HN has an Algolia-based API. It's also _very_ easy to crawl.
         | 
         | I wouldn't call this evil, however: it's merely demonstrating a
         | technique that you _should_ be aware of, if you're a privacy-
         | conscious person. It looks like they also provide some
         | resources for avoiding stylometric detection.
        
           | nanidin wrote:
           | I would bet my bottom dollar that the likes of Reddit and
           | Google already have models to turn a corpus of text into
           | probable demographic data and models to measure the
           | similarity of users.
        
         | faeriechangling wrote:
         | It's just statistics. I recall that during his whistleblowing,
         | Snowden intentionally took anti-stylometry measures.
        
       | wizofaus wrote:
       | What match level would you expect to see between two randomly
       | chosen individuals?
        
       | seydor wrote:
       | does it use the _most_ used words or least used?
        
       | [deleted]
        
       | yyt554 wrote:
       | Fun exercise would be to find all accounts that suddenly stopped
       | posting around today and correllate them with new accounts
       | created around today.
       | 
       | All those scared folks who naively think that it's not too late
       | yet. Busted.
        
       | super256 wrote:
       | Ahhh, anyone remembers this hacking crew who leaked BLUEETERNAL
       | and other NSA tools and exploits? Shadowbrokers.
       | 
       | They were always communicating in some kind of meme-russian, and
       | their texts were funny to read. [1]
       | 
       | I believe their writing mostly defeated this kind of analysis, at
       | the cost of looking like idiots (which was probably the reason no
       | one sent them crypto-dollars to buy that stuff exclusively).
       | 
       | Here's an excerpt:
       | 
       | "Attention government sponsors of cyber warfare and those who
       | profit from it !!!!
       | 
       | How much you pay for enemies cyber weapons? Not malware you find
       | in networks. Both sides, RAT + LP, full state sponsor tool set?
       | We find cyber weapons made by creators of stuxnet, duqu, flame.
       | Kaspersky calls Equation Group. We follow Equation Group traffic.
       | We find Equation Group source range. We hack Equation Group. We
       | find many many Equation Group cyber weapons. You see pictures. We
       | give you some Equation Group files free, you see. This is good
       | proof no? You enjoy!!! You break many things. You find many
       | intrusions. You write many words. But not all, we are auction the
       | best files."
       | 
       | [1]
       | https://archive.ph/20160815133924/http://pastebin.com/NDTU5k...
        
       | lettergram wrote:
       | Did something similar in 2018 (still running locally) which could
       | damask anyone
       | 
       | https://twitter.com/austingwalters/status/104189476543920128...
       | 
       | Made both Metacortex.me and insideropinion.com
       | 
       | The idea being you don't actually need an active directory. It
       | would drop in, figure out all the users (provided one account was
       | on the AD) and would monitor everyone's skill sets, morale,
       | schedule, etc. Worked super well for what it was / is.
        
       | thr0v_awway wrote:
       | writing from throwaway:
       | 
       | Holy shit, it works really, really good. It found all of my older
       | accounts.
        
       | oliwary wrote:
       | Cool! I wonder if it could be run backwards, to identify the
       | users on hackernews with the most unique voices.
        
       | Wistar wrote:
       | I have only ever had a single account but it returned 19
       | possibles with no confidence above .54 but 11 bolded. My own
       | account was listed at the top with a confidence of .9999.
        
         | Macha wrote:
         | Yeah, I have a bunch of bolded mutuals but none above 0.45. I
         | think I have had one or two alts in the past, but probably they
         | didn't make the 10000 word threshold for inclusion (nor can I
         | remember their names to check if they work in inverse).
        
       | [deleted]
        
       | [deleted]
        
       | karol wrote:
       | Are you going to try it on Twitter?
        
       | srean wrote:
       | I tried it on a few user-ids that I strongly suspected were owned
       | by the same person. My hunches stand corroborated. Not sure who
       | is corroborating whom though, me or the script.
       | 
       | Good job.
        
       | msla wrote:
       | It puts almost all of my old accounts decently near the top, but
       | my original account is almost comically low.
        
       | ALittleLight wrote:
       | Of the top ten accounts listed for my name two of them are me.
        
       | zem wrote:
       | heh, I looked up the top bold hit for my name and they really do
       | sound a bit like me (:
        
       | dibt wrote:
       | This doesn't seem to include text from submissions.
       | 
       | I ran it on Brian Armstrong's temp account from here, and it said
       | it didn't write 10,000 characters:
       | 
       | https://news.ycombinator.com/item?id=3754664
       | 
       | EDIT: Or maybe it's something else because Brian only wrote less
       | than 6k characters. But then why can my account be looked up?
       | 
       | Also, I would guess quoted replies are included, which muddies
       | the analysis. Seems to be a very naive implementation. Much more
       | can be done, but this was probably just a quick project.
        
         | costco wrote:
         | Quoted replies shouldn't be included unless there's a bug on my
         | end. Submission text is not included though I probably should
         | have.
        
       | anpat wrote:
       | This needs to exclude who's hiring post because it confuses me
       | with a few of my wonderful former colleagues!
        
       | [deleted]
        
       | antirez wrote:
       | writing "antirez" shows accounts with spanish names (none is
       | mine). I guess Italian and Spanish speakers write very similarly
       | English, but on HN there are a lot more Spanish speakers than
       | Italian ones so that's what I get.
        
       | operator-name wrote:
       | What does the bold signify? For example when I search for dang
       | (https://stylometry.net/user?username=dang) the 4th most likely
       | user is not bold whereas the 16th is?
        
         | costco wrote:
         | Say you see user2 listed in bold on user1's page. That means
         | that user1 is also in user2's top 20 users. In my experience it
         | is often an indicator of a good match (but not always).
        
           | operator-name wrote:
           | Huh, that's a somewhat non intuitive property.
        
             | silasdavis wrote:
             | It is a bit, but if stylometric equality was a thing you'd
             | expect it to be symmetric, so if stylometric simmilarity is
             | a thing....
        
       | Trouble_007 wrote:
       | Nice work! Thank you, of course I plugged in the obvious HN
       | usernames
       | 
       | Edit to add;
       | 
       | Would be nice to have the
       | https://news.ycombinator.com/user?id=username links included.
        
         | Trouble_007 wrote:
         | And perhaps rounding to 3 or 4 decimal places?
        
       | iHateStylo wrote:
        
       | mysterydip wrote:
       | I was curious to use this on myself to see if anyone writes like
       | me. Closest was a .51 confidence, so I guess not?
        
       | harryvederci wrote:
       | My runner-up has a rating of 0.42378790667730715
       | 
       | C'mon guys, work harder. That's not even close! :-D
       | 
       | Btw, I myself am only at 0.9999999999999999 so I guess I need to
       | work harder at being myself.
        
       | iambateman wrote:
       | This is cool!
       | 
       | If an account returns a high score for many accounts, does that
       | also mean they're relatively less original in style?
        
       | medellin wrote:
       | How much writing do you need to analyze results? Would changing
       | account every X sentences eliminate this?
        
         | costco wrote:
         | Current minimum is 10000 characters. In my own tests accuracy
         | was still pretty good at 3-5000 but I instituted the 10000
         | minimum to reduce false positives. Yes it would, if you read
         | the advice page on avoiding detection that is one of the things
         | I recommend. Unfortunately HN moderators do not really like
         | that.
        
       | rmelhem wrote:
       | nice one. are you using gpt3 under the hood?
        
         | costco wrote:
         | I'm not that smart - my site is basically just doing some
         | calculations on word frequencies. You can read
         | https://academic.oup.com/dsh/article-abstract/17/3/267/92927...
         | and
         | https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...
         | and https://news.ycombinator.com/item?id=33755898 for more
         | information.
        
           | dunham wrote:
           | I am curious whether it could pick GPT3 out of the crowd.
        
           | ghaff wrote:
           | As you mention on the site, you don't do punctuation. But I'm
           | guessing there are some pretty good fingerprints like:
           | 
           | two spaces after a period
           | 
           | Whether someone uses an em-dash/single hyphen/double hyphens
           | (which may correspond to house style they're used to)
           | 
           | Whether they use semi-colons
           | 
           | (Presumably harder) but consistent substitutions like loose
           | for lose, break for brake, etc.
           | 
           | Use of accents
        
           | sillysaurusx wrote:
           | Don't sell yourself short. Simplicity is smart. It's
           | astonishing how often the simplest thing turns out to be
           | exponentially more effective than the so-called smart thing.
           | 
           | I can't get over how phenomenal this is. Please put every one
           | of your side project ideas into production!
        
           | isoprophlex wrote:
           | Simplicity is the greatest form of sophistication! Great
           | work!
           | 
           | One small nit from a user experience point of view..: it'd be
           | easier on the eyes if you just truncated those cosine
           | similarity scores (or whatever score you're using) after the,
           | say, 5th digit. Showing the entire float is kinda messy to my
           | eyes.
        
           | Dma54rhs wrote:
           | Its easy to write complicated systems, it takes a genius to
           | make it simple.
        
           | rmelhem wrote:
           | cool and thanks for the clarification. i ask that mainly
           | because of the request limit of openai, which is something
           | that makes many scalable ideas unfeasible
        
       | godisdad wrote:
       | Can we find Satoshi with this?
        
         | thisisnotapipe wrote:
         | Cardano founder Charles Hoskinson believes that only one person
         | fits the profile of the mysterious Bitcoin creator, Satoshi
         | Nakamoto.
         | 
         | In a surprise Ask Me Anything (AMA) session on YouTube,
         | Hoskinson reveals that he has narrowed down his search to one
         | person who he believes is the only individual that fits the
         | part.
         | 
         | "I've been very vocal lately on this. I think that first, it
         | doesn't matter but second, it's probably Adam Back. If you look
         | at the preponderance of the evidence, Occam's razor applies and
         | the most likely answer usually is, and there's no mystique or
         | magic there but he just fits the profile. You're looking for
         | somebody who's in their 40s to 50s who created Bitcoin in 2008.
         | That would fit Adam. English education, grammar, all that
         | stuff, the right computer science background, exactly the right
         | credentials you'd look for. You probably can get pretty far
         | with code stylometry towards validating that."
         | 
         | (from https://tokenpost.com/Charles-Hoskinson-Believes-One-
         | Person-...)
        
         | drpancake wrote:
         | A few people have tried that e.g.
         | https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184...
        
       | crecker wrote:
       | See also: https://serhack.me/articles/unveiling-anonymous-author-
       | stylo...
        
         | costco wrote:
         | That post was actually what motivated me to make this. I'm on
         | your email list :)
        
           | crecker wrote:
           | WOW! It's such a pleasure for me
        
       | neodypsis wrote:
       | How do you protect yourself from impersonators?
        
       | nvr219 wrote:
       | I only got 0.9999999999999992 for myself :(
        
         | noncoml wrote:
         | Naturally Born Imposter
        
       | bumble_bee900 wrote:
       | It's accurate enough that I had to create a new account now :)
       | 
       | I guess it's difficult to evade it as the word frequency
       | certainly catches all about the countries I frequently refer,
       | programming languages, interests etc.
        
       | [deleted]
        
       | [deleted]
        
       | p4bl0 wrote:
       | It's funny that I only match at 0.9999999999999982 with myself
       | while all other username I tried matched with themselves at 1.0
       | ^^.
        
         | srean wrote:
         | https://theuijunkie.com/myth-or-fact-did-charlie-chaplin-los...
        
       | [deleted]
        
       | sillysaurusx wrote:
       | Wow. This gives a lot of false positives, but it found all ~10 of
       | my old accounts over the years.
       | 
       | The most interesting thing is that my writing style changed
       | pretty drastically since a decade ago. Searching for my oldest
       | account matches my earliest usernames, whereas searching this
       | account matched the rest.
       | 
       | The details of the algorithm are fascinating:
       | https://stylometry.net/about Mostly because of how simple it is.
       | I assumed it would measure word embeddings against a trained ML
       | model, but nothing so fancy.
        
         | [deleted]
        
         | FormerBandmate wrote:
         | > sillysaurus3
         | 
         | > sillysaurus2
         | 
         | Tbf a human could have found a bunch of them relatively easily
        
         | lettergram wrote:
         | Frankly similar to how I was doing in back in 2018 (when you
         | and I chatted about it on HN lol)
         | 
         | https://news.ycombinator.com/item?id=17944293
         | 
         | The approach I took was a bit different, but also no ML
         | required.
         | 
         | The real trick is pruning and going cross platform. There are
         | around 100k active HN accounts (meaning posts a few times a
         | year), maybe 200k if you count at least one post a year. But
         | <10k that post weekly.
         | 
         | It's a very small space to try to compare so simple methods
         | will work fine.
        
           | costco wrote:
           | Exactly. HN emphasizes long-form posts much more than other
           | forums which makes the commenters here very susceptible to
           | this kind of analysis. Plus you can fit every single HN
           | comment in RAM on a mid tier gaming laptop so it's even
           | easier. I was trying to think of applications of this kind of
           | data and the only thing I could think of was moderation
           | tools/detecting ban evaders but what you've done seems much
           | more profitable lol.
        
         | echelon wrote:
         | It works like a charm for me too.
         | 
         | I put in my username and found my pre-echelon alt,
         | possibilistic.
         | 
         | (Echelon was taken when I registered possibilistic, but it must
         | have been unused and dropped.)
        
         | costco wrote:
         | Yeah top 20 is a little excessive because in my own tests I
         | found that top 20 is only marginally more accurate than top 10.
         | You can get a more academic explanation [here](https://www.tand
         | fonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed
         | too because it seemed too easy!
        
           | sillysaurusx wrote:
           | FWIW, top 20 was necessary for mine. The bolding was a
           | brilliant move. Several of my accounts were ranked 10-20, but
           | popped out due to the bolding.
        
             | justusthane wrote:
             | What does the bolding indicate?
        
               | sillysaurusx wrote:
               | The explanation is here:
               | https://news.ycombinator.com/item?id=33755466
               | 
               | As far as I'm concerned, it's the killer feature of the
               | app. The top 20 results may be noisy, but the bolded
               | results have a signal to noise ratio close to infinity.
        
               | costco wrote:
               | The funny thing is that I thought of it while eating
               | dinner last night :)
        
               | jsnell wrote:
               | The precision of the bolded results looks like maybe 30%
               | to me. Significantly better than the non-bolded, but
               | nowhere near perfect precision.
        
               | costco wrote:
               | False positives become an increasingly difficult problem
               | the more and more potential authors you introduce. If I
               | had wrote a fancier model it probably wouldn't be as much
               | of a problem but what can you do.
        
               | jsnell wrote:
               | Yes, this wasn't a criticism of the tool. It is crazy
               | good.
               | 
               | But I don't think people should be making the assumption
               | that bolded results are definite alts, which sillysaurus'
               | comment reads like.
        
               | sillysaurusx wrote:
               | Hmm, that wasn't my intent. I see this tool as a
               | recommendation engine more than a doxxer. By "signal to
               | noise ratio close to infinity," I meant that if you visit
               | one of the bolded accounts, they'll probably sound a lot
               | like you.
               | 
               | It's one of those ideas that makes the tool substantially
               | more effective, yet never would've occurred to me. It's
               | like the simplicity of pg's "a plan for spam" algorithm:
               | deceptively simple, but (like scrubbing dishes with
               | fingers) works really well.
        
               | dragonwriter wrote:
               | Of my top 20, 19 are bold, all are above 0.6, and I have
               | no alts.
        
               | loeg wrote:
               | I have 7 bolded names (0.53-0.62) in the top 20 list, and
               | none are alts of mine.
        
               | morsch wrote:
               | I'm one of them and I can confirm. But then again that's
               | what I'd say if I was.
        
               | loeg wrote:
               | Hi style-adjacent friend :-). Just briefly looking at
               | your recent comment history, we seem to find different
               | kinds of articles interesting, but maybe have a similar
               | writing style.
        
               | ghaff wrote:
               | Pretty much the exact same. (I do have a throwaway
               | account but I rarely use it and it probably hasn't been
               | used enough to qualify.)
        
         | User23 wrote:
         | I'd figured it would be some kind of n-gram frequency analysis.
         | Would be interesting to code that up and compare.
        
           | costco wrote:
           | It is. The description on the about page is a little
           | simplified but I basically I look at the most common word and
           | character ngrams of size 1,2,3 (200 each), put all the
           | frequencies in an array and then compare to all the other
           | users with https://scikit-
           | learn.org/stable/modules/generated/sklearn.me....
        
             | User23 wrote:
             | Cool, I only skimmed the description maybe I needed to read
             | it more carefully.
             | 
             | Have you considered doing rune rather than word ngrams? I
             | can imagine that might be prohibitively expensive, but I
             | really don't know. I did something like that long long ago
             | in C for automatic document language detection. It was
             | quite accurate.
        
       | throwaway5434q wrote:
       | Wow. This is insane, it found my old accounts. So throwaway
       | obviously (because I'm a bit of an asshole) but this really is
       | amazing. It also highlighted another account that's not me, but
       | looking through their comments i don't see any resemblance to me
       | either.
        
       | stavros wrote:
       | Oh wow, it's really sure that I'm stavrosk, which I am:
       | 
       | https://stylometry.net/user?username=stavros
       | 
       | The next person is 30% less certain, that's huge! This would
       | basically identify any alt I might have with near certainty.
        
         | jvolkman wrote:
         | stavrosk doesn't have any posts/comments? What's it using to
         | match?
        
           | stavros wrote:
           | It's my old username.
        
             | costco wrote:
             | Huh... seems there are some inconsistencies between what's
             | presented on news.ycombinator.com and the Firebase API.
             | Glad it matches for you though :)
        
               | stavros wrote:
               | I guess they just didn't go back and reparse, not a big
               | problem. I don't think people change their username
               | frequently :P
        
         | rogual wrote:
         | Funny thing is, it thinks I'm you, but it doesn't think you're
         | me!
         | 
         | https://stylometry.net/user?username=rogual
         | 
         | I'd have thought this stylometry thing would be commutative.
        
       | ed25519FUUU wrote:
       | Very cool! And really a shame that you're not allowed to delete
       | an old alt account or comments on HN! It follows you forever
       | apparently.
        
       | Arathorn wrote:
       | It found my old account (ara4n; i lost the password) at 0.63.
       | More amusingly it found my cofounder too, who hardly ever posts
       | here (at 0.48)
        
       | thot_experiment wrote:
       | Maybe this is a good tool to find new friends. :P
        
       | pkos98 wrote:
       | Sorry dang, aka sctb: https://stylometry.net/user?username=dang
        
         | Macha wrote:
         | In this particular case, it seems to be picking up the stock
         | moderation responses as it looks like sctb was a moderator
         | account until 2019.
        
       | alpacabag wrote:
        
       | Semaphor wrote:
       | My alt accounts (not really, all below 0.5) seem to also be
       | European or German Firefox users. Good for us ;)
        
       | nr2x wrote:
       | Honeypot to see what accounts are tested in sequence?
       | 
       | ;-)
        
         | costco wrote:
         | I turned off nginx logging if that makes you feel any better.
         | Of course there's no way for you to verify that because I'm
         | just a random guy on the internet but I will tell you that I am
         | a civic minded citizen who is concerned about privacy and the
         | Internet.
        
           | nr2x wrote:
           | Only half kidding, but I'd I were state Intel it's what I'd
           | be doing. :D
        
       | atum47 wrote:
       | at what threshold is it considering alt account?
        
         | costco wrote:
         | There is no threshold. This site does not make any call as to
         | whether a user is an alt or not. It just gives the users with
         | the most similar word choice and from there it is up to you to
         | decide (is there a very specific detail that both accounts
         | mention, do they post at similar times, etc). I will say bolded
         | accounts are substantially more likely to be alts though. But
         | obviously it is not guaranteed that every user has an alt.
        
       | F_r_k wrote:
       | Found my phone account; I'm quite impressed, really !
        
       | ufmace wrote:
       | I wonder what's a reasonable threshold for "probably the same
       | person". I've never had an alt on HN, and when I searched myself,
       | it found 3 other users above 0.6, none of whom I've ever heard of
       | before.
        
         | [deleted]
        
         | costco wrote:
         | If it's >0.9 is you can almost guarantee it's an alt but I've
         | seen certain matches at 0.6. The problem is writing styles
         | change over time. Another idea I had was converting the scores
         | which are just cosine similarity scores into percentiles (so
         | 0.99 would be 99th percentile of certainty) to make them more
         | human interpretable.
        
           | forgotpwd16 wrote:
           | >The problem is writing styles change over time.
           | 
           | Will be interesting if we could plot the writing style
           | divergence over time.
        
           | throwdbaaway wrote:
           | I got matched with my old account with a score of only 0.45
        
           | bonzini wrote:
           | The people at 0.4-0.6 with me do share some interests. That's
           | cool on its own.
        
           | throwup wrote:
           | I make new accounts every so often and the accounts of mine
           | that it found have a score of around 0.3. I'm not actively
           | trying to defeat stylometry but it's possible I just have a
           | particularly unremarkable writing style.
        
             | xwolfi wrote:
             | Well I must be stereotypical myself because it found me at
             | 0.8 !
        
         | MBCook wrote:
         | I have no alts. The highest match for me is about 0.66.
        
         | dotancohen wrote:
         | Interesting. The highest non-me account is under 0.4 on my
         | page. I do not believe that I have such a unique writing style
         | - especially since half my posting is on mobile and therefore
         | possibly slightly different than my desktop posts.
        
           | dwringer wrote:
           | My closest is 0.4879. I know I tend to be wordy but I thought
           | I had a pretty generic style as well. This is definitely a
           | fascinating demonstration.
        
             | drdec wrote:
             | Feeling better about my high of 0.49 now
        
         | pyb wrote:
         | 0.6 is not high enough to indicate an alt
        
       | Yeahsureok wrote:
       | On the how to avoid section: Isn't running comments through a
       | randomised translator a few times then back considered a
       | countermeasure also?
       | 
       | Also think it's probably poor form to list users as examples
       | without their permission.
        
         | costco wrote:
         | > On the how to avoid section: Isn't running comments through a
         | randomised translator a few times then back considered a
         | countermeasure also?
         | 
         | Yes.
         | 
         | > This may be out of line but isn't pg on here with a different
         | username, Levenschtein distance of one that's not included? Or
         | is that just a very motivated 13yo account who writes a lot of
         | admin-esque comments.
         | 
         | What other pg account are you referring to? I want to see it so
         | I can see what my algorithm missed.
         | 
         | > Also think it's probably poor form to list users as examples
         | without their permission.
         | 
         | You're right. I'll remove that - I just wanted some examples
         | especially for people on phones who don't feel like typing.
         | Thanks for the feedback.
        
         | jacooper wrote:
         | > However, using automated methods like machine translation
         | services do not appear to be a viable method of circumvention.
         | 
         | https://www.whonix.org/wiki/Stylometry
        
       | Lichtso wrote:
       | I wonder how much this can be improved if metadata is taken into
       | account as well. Especially the distribution of common post dates
       | and times modulo a week, which also exposes in which timezone
       | somebody probably lives.
        
       | 2OEH8eoCRo0 wrote:
       | This found an old account that I forgot I even had but with a lot
       | of false positives. Neat!
        
       | bscphil wrote:
       | The scary thing is that once you have this data, finding HN
       | matches for individual targeted users on other sites becomes
       | trivial, even if those sites are harder to scrape. I bet most
       | people here have an anonymous Reddit account, for example. If you
       | wanted to know who was behind a particular Reddit account, you
       | could feed it into something like this and compare the results
       | with HN, where accounts are less likely to be anonymous. Or build
       | a database based on blogs, Github comments, etc.
       | 
       | Also, since this uses only word frequency, there are probably
       | relatively easy improvements to make that would make it even more
       | powerful, like looking at particular runs of words that are
       | unique. Some expressions or figurative language only show up in
       | combinations of words, and tend to be highly style specific.
        
         | faeriechangling wrote:
         | Thus proving the only actually anonymous community in practice
         | is 4chan, and that's why it's so toxic.
        
           | sbierwagen wrote:
           | If you define "toxic" as "people disagreeing with you", sure.
           | That was what the entire internet was like until maybe 2005.
        
             | philosopher1234 wrote:
             | "People disagreeing with you" describes almost none of the
             | conversation on 4chan
        
             | ben_w wrote:
             | I'm old enough to remember when 4chan was _self
             | identifying_ as the Internet 's hate machine, before xkcd
             | referenced it as such: https://xkcd.com/591/
             | 
             | Sometimes people insist that's all role-play and irony;
             | others insist that if it ever was, it certainly isn't now.
             | 
             | But regardless, I remember pre-2005, and it wasn't all like
             | what I saw the two times I looked at 4chan. Bits were. Bits
             | were _much worse_. But mostly, _mostly_ , people were
             | kinder... at least, unless political tribalism came up.
        
         | costco wrote:
         | I could have used a part of speech tagger, looked at time of
         | day a user posts, capitalization, spelling errors, etc. From
         | what I understand the state of the art is lightyears ahead of
         | this, there are even companies with actual linguists who will
         | act as expert witnesses in court to say stuff like "we can say
         | with 95% certainty that xyz authored this email." Honestly it's
         | kind of scary. There are papers that talk about cross platform
         | authorship attribution, one I think did it with Twitter,
         | Blogspot, G+ and had pretty good results.
        
       | saurik wrote:
       | It would be convenient if the usernames linked to the comment
       | pages on Hacker News (to avoid having to copy/paste and URL hack,
       | which is made even slightly more annoying because for some reason
       | when I tap and hold the usernames to copy them your markup--I
       | haven't looked at why yet--is causing an extra space character to
       | get copied on the left).
        
       | honkler wrote:
       | Not today.
       | 
       | You fail, I win.
        
         | costco wrote:
         | Nice. Just out of curiosity are you taking any countermeasures
         | or varying your writing style across accounts in any way?
        
           | psychphysic wrote:
           | My second closest match was 0.35 but searching people where
           | they have matches 0.5-0.75 I suspect that's mostly to do with
           | number of posts leading to better statistics.
        
       | soneca wrote:
       | I have two accounts. This one, "soneca", that is my first one and
       | most active by far, and another one that I use sometimes mostly
       | for Show HN and few comments.
       | 
       | When I searched the other one, "soneca" was the first guess, with
       | 0.4.
       | 
       | But when I searched "soneca", the other one was not in the top
       | 20.
        
       | 00F_ wrote:
       | ive had maybe a hundred throwaway accounts on HN over the past
       | ten years. generally, i make an account, say something that is
       | apparently wildly offensive to someone else, get flagged and
       | down-voted and then muted or hell-banned. then i make another
       | account because i never did anything wrong and start the process
       | over again. ive emailed the admins, tried to reason with the
       | admins, it never does any good. the power is held by power-users
       | who flag people -- most of the power of an admin at the end of
       | the day but without any of the accountability. as long as they
       | are following the mainstream dogma, its all good.
       | 
       | anyway, this app was able to identify a lot of my accounts. but a
       | lot of the matches werent me. bold matches were almost all me.
       | but i know there are many more matches than those that were
       | listed. it mainly showed my most recent accounts.
       | 
       | i think most people would get a sick feeling in their stomach if
       | they tried this app. i dont think people are prepared for a world
       | where you can type someones name into an app like this and
       | produce everything ever recorded online that was created by that
       | person. not only this but everything highlighted and summarized
       | to answer any question about that person. this is what advanced
       | ai will bring us. an information implosion where the planet-sized
       | ocean of data that is just floating all around us suddenly and
       | violently coalesces into the objects of our new societal
       | calculus. violent is a good word. and this is just the change
       | that one can see coming with ai.
        
         | costco wrote:
         | You are definitely right. Part of the reason I chose the 10,000
         | character minimum was so that people using throwaways in the
         | true sense would be entirely excluded. I don't plan on keeping
         | this up forever and I too would not feel comfortable if this
         | was deployed at scale.
        
           | ayewo wrote:
           | Would you be open to open sourcing the code when you decide
           | to shutdown the service?
        
         | stupendous_luck wrote:
         | You really don't need advanced AI to do it. Just a bunch of
         | scrapers and some run of the mill statistics. And guess what,
         | it's been done by many companies already. They just don't care
         | to create such a site.
        
       | moneywoes wrote:
       | What algorithm is being used?
        
         | interroboink wrote:
         | It's described here: https://stylometry.net/about
        
       | rglover wrote:
       | It's moments like this I'm proud to have my insanity on full
       | display without obscurity. Was surprised to see a bunch of ~30%
       | matches despite not having any alts.
        
       | kfichter wrote:
       | Does anyone here have a reasonably wide variety of similarity
       | ratings? I'd love to see the difference between a 0.2 and a 0.8
       | for the same account.
        
       | peacelilly wrote:
       | This is creepy.
        
         | noncoml wrote:
         | I think the word you are looking for is uncanny
        
       | jallasprit wrote:
       | Most likely candidates:                   pg: 1.0
       | montrose: 0.604073065373204         mattmaroon:
       | 0.5900372458160795         natsu: 0.5519832271289953
       | rauljara: 0.5418566694533273         waterlesscloud:
       | 0.5378996309342633         damoncali: 0.5292014150349463
       | gruseom: 0.5290151637991445         kemiller2002:
       | 0.5254174524920762         jfengel: 0.5231938496089998
       | jamesaguilar: 0.5229081613163672         houseabsolute:
       | 0.5219738531025365         danssig: 0.5195368367601849
       | austenallred: 0.519343009683366         loewenskind:
       | 0.5177030083877397         baguasquirrel: 0.5153841099708854
       | asdfasgasdgasdg: 0.5146704002447524         aptwebapps:
       | 0.5144149629369845         allenbrunson: 0.512802806408646
       | danielweber: 0.5123620795710832
        
         | [deleted]
        
       | andsoitis wrote:
       | we leave fingerprints everywhere
        
       | throwawayhghcj wrote:
       | I'd like to request the author takes this offline please until
       | the implications can be thought through.
       | 
       | This is breaking anonymity that people incorrectly thought would
       | not be revealed.
       | 
       | For some it might be awkward, others it might be quite
       | problematic.
        
         | s3000 wrote:
         | This is nothing new, e.g:
         | 
         | Analyzing stylistic similarity amongst authors
         | 
         | https://news.ycombinator.com/item?id=10050603
         | 
         | http://markallenthornton.com/blog/stylistic-similarity/
         | 
         | 37 points by lingben on Aug 12, 2015
        
         | kaba0 wrote:
         | I would agree with you but the genie is out of the bottle
         | already. Nigh everyone can and could have reproduced these
         | results, especially that archive.org and similar things exist.
         | 
         | So, I don't think it causes any new harm, if anything it gives
         | you future risk aversion.
        
       | silasdavis wrote:
       | The top hit for me, though not a very high correlation (0.3 ish),
       | is to my surprise someone I have met. I don't appear on their top
       | 20 though.
        
       | musicale wrote:
       | > I made this site mostly to show how easy this is and how it can
       | erode online privacy
       | 
       | looks like it can indeed
       | 
       | > Here are some frequent HN commenters: (EDIT: Removed due to
       | privacy concerns)
       | 
       | How surprising that someone might object to being included in a
       | demonstration of the erosion of privacy!
       | 
       | Is the site opt-in or opt-out?
        
         | Aachen wrote:
         | I doubt they asked 78k users for permission when there's no
         | standardized way of reaching out if you're not a site admin.
         | It's opt out if anything.
        
           | bee_rider wrote:
           | You opt into making your writing publicly available when
           | making posts on this site. I'm not sure what Ycombinator's
           | user agreement* says about this, but it is pretty obvious
           | that they haven't done anything to prevent it (and it isn't
           | clear what they could do).
           | 
           | * and I mean they author of the tool is here making posts, so
           | I guess they have agreed to the TOS, but clearly someone who
           | hasn't agreed to it could also make this tool and scrape out
           | publicly available posts without agreeing to anything.
        
       | StrangeDoctor wrote:
       | Have you done any data analysis on distributions of similarity?
       | How similar you'd expect any 2 people to be given English focused
       | around tech? Or any other interesting stats you'd like to share?
       | 
       | Very nice clean site, great work.
        
       | Ros2 wrote:
       | I interviewed years ago with someone who let me know that they
       | use a pseudonym as an employee and their chosen name even got
       | posted as the author for articles they wrote for the company.
       | They were very concerned about their privacy.
       | 
       | I know their blog, which is their HN username, and this tool
       | found their other account.
       | 
       | Perhaps ironically, this person stood out a lot because of this
       | and I didn't forget them.
        
       | zxcvbn4038 wrote:
       | How long until this becomes the algorithm for a dating site?
       | 
       | "Find hot single women who write just like you"
        
         | nrp wrote:
         | This seems like a great way to hire freelance copywriters/ghost
         | writers too. I would absolutely hire someone I knew could match
         | my tone well for writing generic unattributed copy.
        
         | forgotpwd16 wrote:
         | Wouldn't be surprised if dating sites already used similar
         | algorithms.
        
           | bornfreddy wrote:
           | Wouldn't be surprised if most of the women on a specific
           | dating site had very high similarity scores.
        
       | davebillyhock wrote:
       | This found an alt that I created specifically to see if I could
       | write artificially to defeat this kind of analysis. I have seen
       | other tools like it posted to HN, but none before had found that
       | account. I guess I need to up my game.
        
         | [deleted]
        
         | CharlesW wrote:
         | If you don't mind sharing, are you "writing artificially"
         | purely in your head, or are you using techniques like
         | intermediate translations?
        
           | davebillyhock wrote:
           | No mechanical means, but I have referred to a thesaurus
           | occasionally. Mostly I tried to change my sentence structure,
           | not just words. It requires actually thinking differently, in
           | a way. Which makes it difficult to know how well I'm
           | communicating.
        
             | crtified wrote:
             | I imagine this would be quite difficult in practise, due to
             | all the subliminal factors behind a person's writing
             | choices.
             | 
             | For example, as somewhat illustrated here, your personal
             | vocabulary is a kind of fingerprint. As you mention, using
             | a thesaurus can somewhat alleviate that, but if a thesaurus
             | is only changing a small % of your words, then it will only
             | have a suitably small % effect upon analysis.
             | 
             | To go yet further might (I suspect!) entail methods such as
             | directly lifting and using other people's sentences to
             | convey your own thoughts. But even then, "your own thought
             | patterns" are still informing the manner of the post, to
             | some extent, so over time increasingly robust analysis may
             | still find patterns to hook into.
        
               | neodypsis wrote:
               | I wonder if someone will come up with a Grammarly-like
               | tool which you can feed with sample writings to help you
               | increase/lower the similarity score of a new text you are
               | writing.
        
       | ruined wrote:
       | didn't find a single one of my alts. nice
        
         | costco wrote:
         | I obviously don't expect you to help me but do they have at
         | least >10000 characters written and are you varying your
         | writing style in any way?
        
       | paulpauper wrote:
       | Inserting random Unicode blank, 1/4, 1/2, or zero space
       | characters into your writing may help thwart it too, if you are
       | paranoid
        
         | UncleEntity wrote:
         | Huh, that's how I signal my KGB handler...
        
       | lifeisstillgood wrote:
       | How much should we fear de-anonymisation ?
       | 
       | A lot of discussion on the thread are over "how can we prevent
       | this". I would like to know why should we not embrace this and
       | similar technologies?
       | 
       | The benefits in my view are large - online behaviour tracks back
       | to real life - and epidemiology speaking the value of millions of
       | test subjects across every question are invaluable - from
       | traditional medicine to "mass psychology recommendations"
       | 
       | I can guess some downsides (hiding from abusive exes) but am
       | interested in studies, surveys, reports etc - any HN thoughts
       | welcome
        
         | headhasthoughts wrote:
         | What could possibly be the harm in allowing people to harass
         | others based on posts they made decades ago? What could
         | possibly be harmful in making a person who for whatever reason
         | has changed their online identity easier to track? What could
         | be remotely harmful about allowing Marlboro to find the
         | accounts of ex-smokers? What could be the harm in tracking
         | underaged users site by site?
         | 
         | I'm sure this is completely harmless and will not harm society.
        
         | rejectfinite wrote:
         | >online behaviour tracks back to real life
         | 
         | This is good to you?
         | 
         | Okay, let's just make it like China or SK where your login is
         | your citizen ID and if you write bad things the bad word police
         | will take you away.
         | 
         | Also, no, I have no alts.
        
           | lifeisstillgood wrote:
           | So I am asking because my views are only challenged inside my
           | own head, hence the need for external thoughts.
           | 
           | But firstly the "governments will come and do bad things"
           | argument - yes this is clearly and obviously a major problem
           | - but not one solvable by technology in anyway. Fixing
           | violent dictatorships is a IRL problem - one that requires
           | enormous effort and sacrifices (see Ukraine for obvious
           | example). We cannot pretend that a browser extension or a
           | ground up rewrite of Twitter will defeat Putin or would have
           | stopped Hitler.
           | 
           | As for "free" countries (something like 120+ have open free
           | elections), we still have online abuse for voicing opinions
           | that some people don't like (anything from pro/anti Trump to
           | LGBT and bitcoin etc). Those are real consequences but rarely
           | government inspired and honestly I suspect we need better
           | support for police in prosecuting such things - I mean a
           | death threat is a death threat.
           | 
           | In general my view seems to be we should have the same
           | protections online as we do offline - and if those
           | protections are "in theory only" that requires us to use our
           | voting and other political power to chnage it - not to
           | obfuscate IP addresses or so on.
           | 
           | The upside of tech is so great it is worth spending IRL to
           | defend agains the downsides
        
             | rejectfinite wrote:
             | I am of the generation and mindset that online abuse is not
             | real. Straight up. Log out, turn off the screen and watch
             | Netflix, take a walk and calm down, block the offending
             | user. It's not real.
             | 
             | >I suspect we need better support for police in prosecuting
             | such things
             | 
             | We do see that! But mostly people on Facebook. Here we have
             | had judgements of people who posted threats on Facebook
             | because it is tied to your real name.
             | 
             | And yes, abuse is part of the "fun". Under your system, my
             | 10 years old Leauge and CoD chats would have me locked up.
             | 
             | >I mean a death threat is a death threat.
             | 
             | Is it? I would find it more concerning if someone on the
             | street tells me he is going to kill me than a kid on xbox
             | live.
             | 
             | NOW there is a difference in systematic stalking and
             | harassment online if I would get bombarded with DMs and
             | messages to kys. I don't know how to solve. But a one-off
             | comment is NOT equivalent. Then it feels like I'm just old?
             | At 31? Is it really so serious?
        
         | femto113 wrote:
         | Fear it happening or fear its consequences? Doxxing already
         | happens all the time, but the main tools are things like
         | account names or image search, this sort of tool could take it
         | to a new level. A simple experiment would be to run this same
         | algorithm against another site (say Twitter or Reddit) and see
         | if it can reliably pick out the same peoples' accounts there.
         | Once anyone on the internet can quickly/easily draw that sort
         | of connection it would require incredible diligence to avoid
         | de-anonimyzation while still maintaining any sort of "real
         | self" presence on the internet. How much we should fear the
         | consequences probably depends a lot on how marginalized you are
         | within your society, but since just revealing your gender is
         | enough to invite harassment in many forums I'm not optimistic.
        
       | CrypticShift wrote:
       | Ingenious idea. At the very least, this is just about finding
       | people who write like us, the same way we seek those with similar
       | tastes (music...)
       | 
       | How long before large commercial indexers start offering an
       | efficient (AI based ?) stylometry to agencies and states ?
       | 
       | wait... do you think the NSA is already doing this?
        
         | A4ET8a8uTh0 wrote:
         | They would be silly not to ( apart from creepish profiling of
         | an entire globe population you also get to potentially identify
         | bots ). We all have mannerisms that can easily 'betray us'
         | online. I honestly thought my writing style is more unique, but
         | as it turns out it is somewhat common.
        
           | CrypticShift wrote:
           | > I honestly thought my writing style is more unique
           | 
           | You just showed another possible use case for this kind of
           | tools: "How unique is my writing style ?"
        
           | sitkack wrote:
           | It isn't writing style, but more of phrase selection. If you
           | lean on the same phrases (n-grams), then you will be very
           | very close in a high dimensional space. Colloquialisms are
           | the biggest tell, you should eschew them.
        
         | woodruffw wrote:
         | Stylometry is an old hat technique; you can assume that
         | intelligence services around the globe regularly apply it.
         | 
         | (Statistical stylometry is a little newer and more rigorous
         | than manual stylometry, which essentially involved a human
         | being's judgement call around the similarity of documents.)
        
           | CrypticShift wrote:
           | What about "deep leaning" stylometry ?
        
             | woodruffw wrote:
             | I don't know, but it wouldn't surprise me if someone has
             | tried to apply ML to stylometry. Statistical stylometry is
             | already petty effective, as demonstrated by this site.
        
       | weinzierl wrote:
       | I played a little bit with it and it is baffling how well it
       | finds accounts of people that know each other in real life. So
       | it's not only good for finding alternate accounts but could be
       | used to find peer groups.
        
       | [deleted]
        
       | the_cat_kittles wrote:
       | pretty cool- i think there should be a term for two accounts that
       | have each other as the top most similar account. kinda sad i dont
       | have one :(
        
         | layer8 wrote:
         | Stylotwins?
        
         | philosopher1234 wrote:
         | We're pretty close me and you -- closer than my actual alts
        
           | the_cat_kittles wrote:
           | hello friend! but... id never use an m dash
        
             | philosopher1234 wrote:
             | Well... I would never use a lowercase word after an
             | exclamation point!
             | 
             | ...Because I'm on mobile
        
       | balls187 wrote:
       | No alt, and the highest match is 0.36
       | 
       | And that accounts last several comments were flagged as dead.
       | 
       | I'm a native speaker, but my english succcccks.
        
       | rcarr wrote:
       | One way to get around this legitimately would be by posting a lot
       | of quotes/lyrics/excerpts and the like thus fooling the algorithm
       | unless it had a way to filter them out
        
       | Fnoord wrote:
       | Cool stuff, thank you for sharing your findings!
       | 
       | I don't do throwaway. I either post or STFU. I also STFU on
       | darknet. Its why I found it fun to read/lurk on things like I2P
       | back when it was new. And I know that on a pseudonymous account
       | it is only a matter of time until it can be linked to another
       | pseudonymous account. It would not surprise me if stylometry was
       | used on Dread Pirate Roberts or the people behind The Pirate Bay
       | or the people behind Wikileaks (Assange's sockpuppet accounts).
       | Such can also have been used to verify afterwards instead of
       | beforehand. Though with TPB since it was on clearweb an advanced
       | adversary could have used correlation/timing attack to figure who
       | wrote what.
       | 
       | I'm having fun times recognizing other Dutch people though their
       | usage of English language. For example, a distinctive word I see
       | Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a
       | red flag the person is native Dutch. I wonder if there are
       | stylometry tools available for figuring if someone used physical
       | vs touchscreen keyboard (I used Glider to write this post,
       | spellchecker unavailable).
       | 
       | And yes, organizations like secret service and police should use
       | such tools as well. It is a known tool, why not use it for good?
       | As with any tool, it can be used for good and evil. On HN this
       | could be useful for the mod team (AFAIK nowadays only dang) to
       | find banned people's sockpuppets. Cross-community could also be a
       | fun project: find a HN user's Twitter or Reddit account. And I
       | hope this method is also used to find Russian trolls on social
       | media.
        
         | ghaff wrote:
         | Most people greatly underestimate the power of linkage attacks
         | on anonymity. And it doesn't even take fancy ML. In the context
         | of healthcare records, I like to trot out this 25 year old
         | example of an MIT grad student and the then-governor of MA.
         | 
         | https://ischoolonline.berkeley.edu/blog/anonymous-data/
        
       | dvh wrote:
       | Make a fundraiser and start doing it for other sites.
        
         | costco wrote:
         | It would be possible for Reddit because Pushshift.io archives
         | all the comments there and Reddit is still pretty small. I'd
         | probably need to make things a lot faster. Doing it on a
         | specific subreddit would be very feasible. I'll think about it
         | but I don't actually know if I really want to do that because
         | for instance I've been banned from subreddits before but I
         | don't want a ban from when I was 12 years old to follow me
         | around forever because my writing style hasn't changed.
         | Moderation is the most obvious application of this kind of
         | software.
        
           | rand_user_100 wrote:
           | > I'll think about it but I don't actually know if I really
           | want to do that because for instance I've been banned from
           | subreddits before but I don't want a ban from when I was 12
           | years old to follow me around forever
           | 
           | Insightful that your personal experience and impact on you
           | personally affects your decision. I invite you to think about
           | the impact of the products you build in your CS career by
           | putting yourself in the shoes of other people as well.
           | 
           | Some products should not be built, even though it's easy to
           | build them.
        
       | DenisM wrote:
       | How about this for countermeasure:
       | 
       | As you're typing out a comment the software gives you a list of
       | accounts you're becoming similar to. That way you can adjust your
       | writing as you type.
        
         | bornfreddy wrote:
         | Sounds great, except there are many different similarity
         | measures. Which one does the algorithm use?
        
           | wizzwizz4 wrote:
           | Why not all of them? Which metrics are closer would tell you
           | which aspects of your writing you need to focus on.
        
         | kaba0 wrote:
         | Someone linked it in the thread:
         | https://github.com/psal/anonymouth
        
         | pessimizer wrote:
         | Forget countermeasures, go covert. Write a comment, have the
         | comment be rewritten before submission in order to resemble a
         | targeted account.
        
       | Bhurn00985 wrote:
       | Just a heads up that for everyone who doesn't like to link their
       | alt accounts, maybe not use this tool to see if it works.
       | 
       | Unless the author would run this against all HN user accounts, no
       | need to flag the ones "of interest".
        
       | jl6 wrote:
       | Rebrand it as a soulmate-finder?
        
       | tomrod wrote:
       | Is it weird that my rating is very low compared to alternative
       | options? I have no alts, but I'm curious how similar others might
       | write to me.
        
       | JKCalhoun wrote:
       | The asymmetry is interesting. I have no alts but of course it
       | nonetheless reported accounts similar to mine.
       | 
       | Running then the most similar person to my account did not put me
       | in _their_ top 20.
        
         | sitkack wrote:
         | I believe this is the
         | https://en.wikipedia.org/wiki/Friendship_paradox
        
       | throwboi123 wrote:
       | That's why I always use throwaway :) everywhere. Reddit. HN.
       | Twitter. Everywhere. I'll spam every site with my throwaways.
       | 
       | Long live throwaways.
        
         | kaba0 wrote:
         | That's the point of this post, that you are not safe by
         | throwaways at all, because all of your throwaways can be linked
         | together purely by your textual style.
        
       | silasdavis wrote:
       | > imagine what a company with millions of dollars and a couple
       | dozen PhD linguists could do.
       | 
       | Could they do much better?
        
       | aaron695 wrote:
        
       | [deleted]
        
       | spaniard89277 wrote:
       | I changed my nickname so my employer can't find me here. I'm not
       | amused by this.
        
         | bee_rider wrote:
         | If this basic implementation can catch you, I'd consider it a
         | friendly reminder that changing your account name is not a very
         | effective means of adding privacy.
        
         | googlryas wrote:
         | New account, then translate your comments to Spanish and then
         | back to English using Google translate.
        
       | aryc19 wrote:
       | So what are some good tools to obfuscate style?
        
       | setr wrote:
       | Forget the alternate accounts -- if two users are close in style,
       | there's a decent chance they should be friends. This is an HN
       | friendship machine.
        
       | AviationAtom wrote:
       | Now I can find my HN doppelganger
        
       | sillysaurusx wrote:
       | Ha, gruseom shows up for pg, which is dang's old account. A
       | worthy successor.
       | 
       | This is a fascinating way to find similar HN users who aren't the
       | same person. It's a surprisingly great recommendation engine. "If
       | you like pg, you might also like..."
       | 
       | Sure, the privacy concerns are valid, but the cat's out of the
       | boot. Might as well enjoy the benefits.
       | 
       | montrose is almost definitely pg. Someone who talks about ancient
       | history, Occam's razor, VCs and startups, uses the phrase "YC
       | cos" (relatively uncommon), etc.
       | https://news.ycombinator.com/item?id=17112567
       | 
       | Nicely done. One of the best hacks I've seen in a long time.
        
         | costco wrote:
         | > motrose is almost definitely pg. Someone who talks about
         | ancient history, Occam's razor, VCs and startups, uses the
         | phrase "YC cos" (relatively uncommon), etc.
         | https://news.ycombinator.com/item?id=17112567
         | 
         | I had this hunch too. It's either pg or someone trying really
         | hard to be pg.
        
           | roughly wrote:
           | I mean, this is HN -
           | 
           | > someone trying really hard to be pg
           | 
           | describes half the site.
        
         | asveikau wrote:
         | > Someone who talks about ancient history, Occam's razor, VCs
         | and startups,
         | 
         | I think these are all common topics among HN readers and
         | commenters.
        
         | pyb wrote:
         | Why would montrose be pg ? The correlation is not that high.
         | Looks like a few people have picked up pg's mannerisms.
        
         | VyseofArcadia wrote:
         | > but the cat's out of the boot
         | 
         | It's my first time hearing that variant. Usually its, "the
         | cat's out of the bag" where I'm from.
         | 
         | Do you mean boot in the UK sense, what Americans would call the
         | trunk of a car? Or do you mean a sturdy piece of footwear?
         | 
         | Obligatory xkcd https://xkcd.com/2390/
        
           | sillysaurusx wrote:
           | It's a little writing trick I leaned from (I think) Orwell.
           | Any time you're about to use a common metaphor, try to tweak
           | it. You'll catch readers off guard, which piques their
           | curiosity.
           | 
           | It's a fun game, too. I wish I'd used "the cat's out of the
           | hat," but I didn't think of it till later.
        
             | UncleEntity wrote:
             | Yeah, it's like shooting ducks in a barrel it works so
             | well.
             | 
             | Easy to overuse then people just get annoyed though...kind
             | of like commas, I suppose.
        
             | esfandia wrote:
             | I like mixing metaphors, in this case "the cat's out of the
             | tube". ("the toothpaste's out of the bag" doesn't work as
             | well though)
        
             | InGoodFaith wrote:
             | What you are describing is also known as an eggcorn.
             | 
             | https://en.wikipedia.org/wiki/Eggcorn
        
               | operator-name wrote:
               | That's neeto!
               | 
               | The 2nd example also loosely falls under the
               | classification of malaphor.
               | 
               | https://en.m.wiktionary.org/wiki/malaphor
        
               | sillysaurusx wrote:
               | Thank you! I was trying to find the original essay I
               | learned it from. I'm now pretty sure it was by Poe, but
               | all I can remember is the main advice: avoid common
               | metaphors.
               | 
               | I vaguely remember one of the metaphors in the essay was
               | about a chicken coop melting, or something like that. It
               | was vivid enough to leave a big impression.
        
               | ewilden wrote:
               | I remember this being from Politics and the English
               | Language (https://www.orwellfoundation.com/the-orwell-
               | foundation/orwel...):
               | 
               | " Dying metaphors. A newly invented metaphor assists
               | thought by evoking a visual image, while on the other
               | hand a metaphor which is technically 'dead' (e. g. iron
               | resolution) has in effect reverted to being an ordinary
               | word and can generally be used without loss of vividness.
               | But in between these two classes there is a huge dump of
               | worn-out metaphors which have lost all evocative power
               | and are merely used because they save people the trouble
               | of inventing phrases for themselves."
        
               | sillysaurusx wrote:
               | Thank you so much! That's the one.
               | 
               | (It's remarkable how often a vague description can yield
               | an HN comment with an answer from a clever sleuth like
               | yourself. Much appreciated.)
        
             | sdwr wrote:
             | I love doing this too, it's fun to write.
        
           | [deleted]
        
       | kevmo314 wrote:
       | There's someone (michaelmior if you're around!) with a false
       | positive 0.46 match to me.
       | 
       | Maybe we could be friends :)
        
       | drc500free wrote:
       | This is a super interesting tool for self reflection. Looking at
       | the top 10 similar accounts to mine, it gives me an arms-length
       | view of how other people probably interpret my tone.
       | 
       | I appear to be a well-educated, over-confident know-it-all.
        
         | pavlov wrote:
         | My #3 match is cstross, and now I'm convinced that my life-long
         | secret dream of being a successful sci-fi novelist is basically
         | a matter of typing. (Ideas? Character development? Ruthless
         | editing? Developing an audience? Having a publisher? What do I
         | need of those when the Computer told me I'm practically a
         | genius...)
        
         | bee_rider wrote:
         | I also enjoyed reading one of my style-partner's posts.
         | 
         | The most noticeable similarity is that we both clearly have
         | strong opinions about some things, and like to share
         | information, but also like to be clear about our unknowns or
         | opinions. So, lots of "sounds likes," "probably," "could be"
         | and so on.
         | 
         | The downside is, I guess, this could be seen as a bit weasel-
         | word-y or indirect.
        
         | seydor wrote:
         | we must be a good match
        
         | bhaney wrote:
         | > I appear to be a well-educated, over-confident know-it-all.
         | 
         | Don't we all?
        
           | sdwr wrote:
           | I hate us insufferable nerds. !
        
         | closeparen wrote:
         | That's what we all come to HN for...
        
       | interroboink wrote:
       | This is one reason why I like legal doctrines such as "beyond a
       | reasonable doubt." Even a 0.9 match in a tool like this could be
       | a coincidence, if there are millions of users. But that won't
       | stop people from casually believing "aha it must be an alt
       | account", based on some anecdata.
       | 
       | It's so easy for something like this to be turned into a tool for
       | a witch hunt, targeting innocents.
        
       | dsr_ wrote:
       | I like the way some usernames are only 0.9999999 correlated with
       | themselves.
       | 
       | Perhaps 6 or 7 digits is enough?
        
       | rcarr wrote:
       | This is somewhat similar to how they ended up catching the
       | Unabomber. The FBI were literally at a dead end. They ended up
       | posting one of his letters/manifestos in the paper, somebody
       | recognised a turn of phrase the unabomber used that was unusual
       | and reported it as possibly being their brother, FBI investigated
       | the lead and it lead them straight to him.
       | 
       | Excerpts from wiki:
       | 
       | > Before the publication of Industrial Society and Its Future,
       | Kaczynski's brother, David, was encouraged by his wife to follow
       | up on suspicions that Ted was the Unabomber.[91] David was
       | dismissive at first, but he took the likelihood more seriously
       | after reading the manifesto a week after it was published in
       | September 1995. He searched through old family papers and found
       | letters dating to the 1970s that Ted had sent to newspapers to
       | protest the abuses of technology using phrasing similar to that
       | in the manifesto.[92]
       | 
       | > In early 1996, an investigator working with Bisceglie contacted
       | former FBI hostage negotiator and criminal profiler Clinton R.
       | Van Zandt. Bisceglie asked him to compare the manifesto to
       | typewritten copies of handwritten letters David had received from
       | his brother. Van Zandt's initial analysis determined that there
       | was better than a 60 percent chance that the same person had
       | written the manifesto, which had been in public circulation for
       | half a year. Van Zandt's second analytical team determined a
       | higher likelihood. He recommended Bisceglie's client contact the
       | FBI immediately.[96]
       | 
       | > In February 1996, Bisceglie gave a copy of the 1971 essay
       | written by Ted Kaczynski to Molly Flynn at the FBI.[87] She
       | forwarded the essay to the San Francisco-based task force. FBI
       | profiler James R. Fitzgerald[98][99] recognized similarities in
       | the writings using linguistic analysis and determined that the
       | author of the essays and the manifesto was almost certainly the
       | same person. Combined with facts gleaned from the bombings and
       | Kaczynski's life, the analysis provided the basis for an
       | affidavit signed by Terry Turchie, the head of the entire
       | investigation, in support of the application for a search
       | warrant.[87]
       | 
       | https://en.m.wikipedia.org/wiki/Ted_Kaczynski
        
         | googlryas wrote:
         | It was actually his brother.
        
         | fbdab103 wrote:
         | So is the lesson you should have GPT rewrite your manifesto so
         | as to obscure your personal idioms?
        
           | CharlesW wrote:
           | Or something purpose-built like Anonymouth
           | (https://github.com/psal/anonymouth), although it seems to be
           | both unique and dead.
           | 
           | Also interesting:
           | 
           | > _Ross Ulbricht aka Dread Pirate Roberts, the mastermind
           | behind the infamous Silk Road site which served as a black
           | market for drugs, weapons and fake documents was also well
           | aware of the potential danger of stylometry being used
           | against him. At the time of his arrest in a San Francisco
           | public library, the FBI captured images of his laptop screen
           | as evidence. Guess what what he had bookmarked -- "Science of
           | Stylometry."_
           | 
           | https://medium.com/svilenk/the-case-for-
           | anonymity-12db114f0c...
        
             | rejectfinite wrote:
             | I mean he used an forum account with an email that had his
             | name in it.
        
               | fbdab103 wrote:
               | That's the problem - it only takes a single slip and it
               | is recorded forever. Perfect opsec is an impossibly high
               | bar if you are maintaining an active online presence.
        
       | elteto wrote:
       | Incredible! There was a very active throwaway account here a
       | while back that I always enjoyed interacting with. I suspected
       | the person had more than one account and this found one that is
       | incredibly close, down to the topics.
        
       | DrStrangeLoop wrote:
       | I tried dang's old account (gruseom) expecting to see his dang
       | account listed. Nothing. Tried dang, sctb (a previous admin) was
       | listed as closest match.
       | 
       | I wouldn't rely on these results
       | 
       | https://stylometry.net/user?username=gruseom
       | 
       | https://stylometry.net/user?username=dang
        
         | pvg wrote:
         | _I wouldn 't rely on these results_
         | 
         | You picked a user who posts a massive volume of repeat,
         | template-y comments and found their former colleague who also
         | posted piles of repeat, template-y comments, that being part of
         | both of their jobs.
        
           | DrStrangeLoop wrote:
           | There are a few close matches to dang's style of template-y
           | comments in the results. Afaik none of the listed accounts
           | are Daniel.
           | 
           | I picked dang as he is the figurehead of hn, and didn't want
           | to inadvertently reveal some other user's identity.
        
             | dragonwriter wrote:
             | > There are a few close matches to dang's style of
             | template-y comments in the results.
             | 
             | At least the #1 close match (sctb) was a comoderator with
             | dang, so they were kind of alts as the official voice of
             | HN.
        
       | woodruffw wrote:
       | Neat work!
       | 
       | Out of curiosity: do you filter sentences than begin with '>',
       | indicating a block quote from another user? That might improve
       | the accuracy a little here, if you don't already.
        
         | costco wrote:
         | Yep!
        
       | jsnell wrote:
       | After a few tries on boring accounts, I thought to try the
       | account of somebody who was notorious for an incident outside of
       | HN, and had a (deservedly) bad time at HN for a couple of years
       | before the account went dark.
       | 
       | And yeah, there's a bunch of high confidence (.6-.8) hits for
       | that account, and from a quick browse of the comments of the
       | recently active ones, they look really likely to be alts. Like,
       | all three that I looked at had comments that made it very clear
       | it was this person writing pseudonymously. (E.g. writing on their
       | signature issue, and saying they couldn't go into more detail due
       | to fear of self-doxxing; or somebody literally saying that the
       | alt's claims reminded them of the public writings of the
       | notorious guy years ago).
       | 
       | Obviously I'm not naming the account, but this functionality
       | turned out way creepier than I thought the moment I tried it on
       | the account of somebody who has a reason to disassociate from an
       | existing public persona, but still wants to participate here.
        
         | Animats wrote:
         | 0.6 isn't much. I have 3 matches above 0.6, and they're not me.
         | 20 or so over 0.5.
        
           | input_sh wrote:
           | I get one 0.68 match, which... fair enough. It is an account
           | I've abandoned some years ago, no secrets there.
           | 
           | No other hits above 0.5, so I guess that either makes me
           | pretty unique as a commentator or my English is broken in a
           | unique way.
        
           | jsnell wrote:
           | That's why you manually evaluate the matches. And like I
           | wrote in that comment, I did that manual eval, and these
           | clearly are alts of that main account, not spurious.
           | Narrowing down the pool of accounts you'd need to do this
           | kind of manual evals for by a factor of 100000 is a pretty
           | significant change in capabilities.
        
         | tqi wrote:
         | > quick browse of the comments of the recently active ones,
         | they look really likely to be alts.
         | 
         | Hmm isn't a spot check of comments somewhat tautological, since
         | that is how the tool identifies alts (rather than something
         | like IP address or time of day)? If this had been promoted as
         | "find accounts with similar writing style to yours" would
         | people immediately assume alts?
        
           | margalabargala wrote:
           | I would presume that OP is referring to the actual content of
           | the comments. This just does stylometric analysis, which
           | looks at word choice, but not what the arrangement of the
           | words _mean_.
           | 
           | If some accounts are found to be stylometrically similar, and
           | then a visual inspection also shows them all stating similar
           | opinions, that latter piece of data is a strong signal.
        
         | thesz wrote:
         | I keep no alternate accounts, but this tool reports best
         | matches for me that appear to be Slavic or just Russian - and I
         | am Russian. Best match score in my list is just above 0.5.
         | There are some clearly alternate accounts on the list, their
         | match scores with this tool are well above 0.7.
         | 
         | It is probable that persons of same cultural origin will have
         | similar writing style and vocabulary. It is also probable that
         | persons of same cultural origin would have same relationships
         | with the world as a whole, they would like same things and
         | dislike other same things.
         | 
         | So, in my opinion, it is possible that you have found not only
         | alternate accounts (score above 0.7), but accounts of people
         | with same cultural origin (ones that are around 0.6).
        
           | ricardobayes wrote:
           | My highest was 0.41 and the person writes nothing like me. I
           | guess I'm a unique snowflake after all.
        
             | jrumbut wrote:
             | I have a few in the low 0.5's and, honestly, they seem cool
             | and I want to meet them.
        
             | gilleain wrote:
             | my second highest hit (ie, third in the list) is gwern at
             | 0.45 who i'm fairly sure is not me.
        
               | scarmig wrote:
               | I was actually just looking at near hits for gwern and
               | found what's almost definitely a defunct alt for him.
        
               | gilleain wrote:
               | Well is certainly NOT me, that's for sure.
               | 
               | On an unrelated topic, I'm starting a service to write
               | comments in the style of others to provide plausible
               | deniability for other alt accounts. Rates negotiable.
        
           | vbezhenar wrote:
           | There're 19 other accounts this tool finds similar to me.
           | Those are not my accounts. 0.46 - 0.56 are numbers.
        
             | bbarnett wrote:
             | You are fools, one and all! This tool's only purpose, is to
             | tag people who use it!
             | 
             | Now they know just who cares about which alternate
             | accounts. They _know_!
             | 
             | They freaking know, man!
             | 
             | You have all fallen for their ploy. Fools!
        
               | thesz wrote:
               | I have no alternate accounts and visited the site out of
               | curiosity, because I used to worked in the domain like
               | this.
               | 
               | What I found was worth visiting the site. Somehow notably
               | many accounts with (relatively) high similarity to mine's
               | are sharing at least one of my personal traits.
               | 
               | Which is fascinating, to me.
               | 
               | And I think is worth to be noticed by others - what and
               | how you write can disclose who you are.
        
               | TheOtherHobbes wrote:
               | It knows my IP now.
               | 
               | (Or does it?)
        
               | neodypsis wrote:
               | It offers no privacy policy, so can't tell.
        
             | csa wrote:
             | Fwiw, and as gp mentioned, > 0.7 seems more likely to be
             | alt territory.
        
             | costco wrote:
             | I think people are sort of confused at what this tool is
             | supposed to be which I will concede is partially my fault.
             | The results of this tool are by themselves not indicative
             | of having an alternative account. It generates the 20 most
             | similar users for every single user on the site, regardless
             | of whether they have an alt or not (there's obviously no
             | way for me to know that for every single user). In your
             | case further investigation would reveal that none of those
             | accounts are yours.
        
               | thesz wrote:
               | It is a fun tool, I can assure you. It is just people
               | have found use case you haven't foreseen yourself.
               | 
               | I think your tool should have internal embeddings for
               | each of the user. Also, most probably your tool uses
               | cosine similarity for a search.
               | 
               | Thus, I would like to suggest a feature: recognize simple
               | arithmetic operations over user's embeddings, such as
               | "thesz - 2 * patio11". It will make things even more fun,
               | this way we can find users who are like me and much not
               | like patio11. Even simple additions and subtractions
               | would suffice.
               | 
               | (an idea is taken from properties of word2vec embeddings)
               | 
               | Your tool is thought provoking. What I discovered with it
               | made me think about my use of language and what other
               | languages (body, imagery, etc) I use differently because
               | of who I am. Which made me think about my favorite
               | underrated superhero Cypher [1] - would his innate
               | ability to understand languages make him best detective
               | ever?
               | 
               | [1] https://en.wikipedia.org/wiki/Cypher_(Marvel_Comics)
               | 
               | Thank you!
        
         | phreeza wrote:
         | MD5 of the username is 9abc27e93b7e3c04b7c599017c1cfe5f ? The
         | top one seems an odd one out in that case?
        
           | Aachen wrote:
           | Usernames aren't random enough to be safe as a simple MD5.
           | Perhaps with a strong bcrypt, but similar to PIN codes, it
           | might be better to give partial information like "is the
           | second character an ...", assuming nobody else made similar
           | statements. Or give the first ~two hex characters of the
           | hash, so that it would match 1/(162)rd of the usernames. I'm
           | sure there's also a clever way for a zero-knowledge proof
           | here, probably something with diffie-hellman using the name
           | as your random integer or something, but I'm too sick to
           | think about this stuff right now. Privately sharing data
           | publicly is hard.
        
             | lzooz wrote:
             | Good point - I've been running john on that md5 for a
             | couple minutes :)
        
               | wizzwizz4 wrote:
               | Why use John? Just run down the list of Hacker News
               | usernames; it'll take less time. (Or, better still,
               | don't; just because the privacy's theoretically
               | compromised doesn't mean we have to exploit that.)
        
               | lzooz wrote:
               | I don't think there's a public list of all HN usernames
               | is there?
               | 
               | Found this, it includes 250k usernames, but it's not
               | there. https://www.kaggle.com/datasets/hacker-
               | news/hacker-news-corp...
        
               | [deleted]
        
               | meta2023 wrote:
               | The username in question isn't in this dataset but maybe
               | it was created in the past 10 days, as the max(timestamp)
               | is Nov 16th, 2022.
               | 
               | https://console.cloud.google.com/marketplace/details/y-co
               | mbi...
        
             | ahmedalsudani wrote:
             | Another problem is that it's a small set. If you had a list
             | of all HN users, you could compute md5 for all of them in
             | seconds.
        
           | [deleted]
        
         | kcarter80 wrote:
         | Could you elaborate on why it's obvious why you won't name the
         | account?
        
           | notduncansmith wrote:
           | Maybe to avoid attracting any extra attention to this user?
           | Also, as someone who's read HN for a few years, it only took
           | me 2 guesses to find an account that the above comment
           | describes (and not necessarily the same person).
        
             | [deleted]
        
             | sillysaurusx wrote:
             | It was a classy move by jsnell, too. Thank you.
             | 
             | (I don't know who the comment is talking about, which is
             | how it should be. There's no need to blow someone's cover
             | in a highly visible way. Even if they were satan, they'd
             | still be welcome on HN as long as they're writing
             | substantive, interesting comments that follow the
             | guidelines.)
        
               | Normal_gaussian wrote:
               | Such quality comments would track with most thorough
               | Satan representations.
        
           | Aachen wrote:
           | They obviously don't want it to be known, seeing as they've
           | got alts to post under and avoid going into too much detail.
           | Being able to go out and do your own research is different
           | than posting the information open for everyone to see at a
           | glance.
           | 
           | I would say it's obvious why one might respect that wish (do
           | unto others...), but I'm also aware that my and my culture's
           | sense of privacy goes further than many others'.
        
         | tbrownaw wrote:
         | > _but this functionality turned out way creepier than I
         | thought the moment I tried it_
         | 
         | Hopefully this raised awareness means that people who actually
         | need anonymity will be more likely to know to take precautions.
        
           | kaba0 wrote:
           | Genuinely asking, what way is there to combat this? Is there
           | a tool that takes out stylistic elements of your comment?
        
             | thedragonline wrote:
             | I wonder if gpt3 has a use case here?
        
             | marbu wrote:
             | One way would be to run such tool before posting and then
             | based on the results, tweak the post and repeat until the
             | similarities are not statistically significant. Or instead
             | of tweaking, start posting under a new throwaway account.
             | But this won't save you when some new way to analyze style
             | appears in the future. Moreover there are other types of
             | meta data which can be taken into account to narrow down
             | the search space a bit such as timestamps. And obviously
             | more you write, harder it is to control these things.
        
             | paulgb wrote:
             | The site mentions a service called Quillbot which
             | apparently does just that. https://stylometry.net/avoid
        
           | birdyrooster wrote:
        
             | UncleEntity wrote:
             | You know everyone going to put your username in that tool
             | after a rant like that.
             | 
             | If ever there were a good use for a throwaway account I'm
             | thinking this is it...
        
         | irrational wrote:
         | .6 is high confidence? I did my own username, wondering what it
         | would return, since I know I don't have any alt accounts. The
         | top results are in the .6-.7 range. If they aren't alt
         | accounts, is it just coincidence that we have similar writing
         | styles?
        
           | bee_rider wrote:
           | I think so.
           | 
           | A funny thought -- my "matches" cap out at around .56. Having
           | false positives* in a tool like this might feel like a "bad
           | result" but actually I think it just means that if someone
           | were running this sort of tool across the whole internet, I'd
           | be relatively easy to correlate, while your identity would be
           | intermingled with your .6-.7 partners.
           | 
           | *actually they aren't really even false positives because the
           | tool doesn't promise to detect alts in the first place, just
           | find similar styles.
        
       | joxel wrote:
       | ColinWright is Dang?
       | 
       | Woah
        
       | McDyver wrote:
       | Would this work for Fernando Pessoa and all his heteronyms? :)
        
       | jll29 wrote:
       | The method used, i.e. to calculate the cosine of the two authors'
       | word vectors, is poorly suited for stylometric analysis because
       | it is based on a poster's lexicon and the word frequencies of
       | each word, but ignoring stylistically relevant factors like word
       | order.
       | 
       | Also, the cosine of the vectors of word frequencies conflates
       | author-specific vocabulary and topics; in other words, my account
       | is grouped (with >51% similarity, according to the demo) with
       | someone probably because we wrote about similar things. A strong
       | stylometric matcher ought to be robust against topic shifts (our
       | personal writing style is what stays constant when we move from
       | writing about one topic to writing about another topic, just like
       | our personality is what stays constant about our behavior over
       | time - of course styles do change, but the premise then has to be
       | that such changes happen very slowly).
       | 
       | Stylometrics/authorship identification is interesting and has led
       | to some surprising findings, e.g. in forensic linguistics
       | (Malcolm Coulthard wrote several good books about the topic).
       | 
       | This paper lists some other features that could be used and
       | compares a bunch of techniques:
       | https://research.ijcaonline.org/volume86/number12/pxc3893384...
        
       | agumonkey wrote:
       | Oh god, that thing starts with direct focus on the search field,
       | opening it showed a bunch of old nicknames, I thought it was the
       | result of some study.
        
       | rand_user_100 wrote:
       | On one hand, thank you for showing us all how easy it is to make
       | something like this. No doubt organizations with more resources
       | already have more sophisticated systems in the same vein.
       | 
       | On the other hand, can we agree that this product is unethical?
       | 
       | In many cases, when a person uses an alt, it is a direct and
       | strong signal that they do not wish their other posts to be
       | associated.
       | 
       | So this product is circumventing the explicit will of the person,
       | and making it available to anyone with zero effort i.e. there is
       | no barrier to getting this info.
       | 
       | I met someone about 10 years ago who said they built this at a
       | university. And their argument also was "actually this enhances
       | privacy because it lets you know something something something".
       | And yet their research grants were coming from one source only.
       | 
       | It _can_ be used for good, but most often it won 't.
        
         | A4ET8a8uTh0 wrote:
         | << On the other hand, can we agree that this product is
         | unethical?
         | 
         | It does create a high level of discomfort, because it
         | illustrates well what privacy advocates try talking about to
         | the population at large, but all that said.. how is it any
         | different from regular scraping and analyzing it any other way?
         | 
         | This is a real question.
        
           | rand_user_100 wrote:
           | It's different because you're removing all barriers to access
           | and making it easy and convenient to stalk/dox people.
           | 
           | Imagine you get the urge to track someone, but in order to do
           | that you have to spend a week writing some new software.
           | That's a barrier. And because of it you may change your mind
           | because it's a lot of work with little payoff.
           | 
           | But if that info is just one click away, it's a whole
           | different ballgame.
        
             | [deleted]
        
         | dragonwriter wrote:
         | > On the other hand, can we agree that this product is
         | unethical?
         | 
         | No.
        
       | gus_massa wrote:
       | It would be nice to make the names clickable.
       | 
       | I don't think the list of pg alternate account is accurate. I
       | checked a few. They have many oneliners that is typical of pg,
       | but the topics and style don't look similar.
       | 
       | I searched a few more and got better results. :)
       | 
       | I searched myself (that I know that I have no alternate
       | accounts). I recognize a few users that are interested in similar
       | topics, and I discuss/upvote them many times. But I didn't
       | recognize most of the user of the list.
        
         | costco wrote:
         | > I searched myself (that I know that I have no alternate
         | accounts). I recognize a few users that are interested in
         | similar topics, and I discuss/upvote them many times. But I
         | didn't recognize most of the user of the list.
         | 
         | It's based purely off frequency of the 200 most common English
         | 1 word phrases, 2 word phrases, 3 word phrases, 1 character
         | sequences, 2 character sequences, and 3 character sequences.
         | Topic does not really have anything to do with it. If I had
         | more time I probably would've done a smarter model that
         | accounted for things like that.
        
           | gus_massa wrote:
           | One is also a mathematician. It's trivial that we overuse
           | some technical words even if it's unnecessary.
           | 
           | Another is form Argentina, so I guess the native language
           | leaks, for example using words derived from latin that are
           | not idiomatic.
           | 
           | And there are a few more, that is a honor to be "confused"
           | with, but I have no clue why.
        
       | gavinray wrote:
       | I've complained a lot about Haskell and now it thinks I like
       | Haskell =(
       | 
       | Needs sentiment analysis IMO, otherwise you'll get "Here's a
       | bunch of people who are JUST LIKE YOU", except they use a similar
       | grammar style but hold opposite opinions on the same nouns.
        
         | ahmedalsudani wrote:
         | Serves you right for disparaging The One True Language!
         | 
         | Ok, fine, we'll present Idris with a fig leaf.
        
         | layer8 wrote:
         | It just thinks you engage a lot with Haskell. These are people
         | with who you have something to talk about. :)
        
       | chronogram wrote:
       | Well done, it found my ancient old account.
        
       | [deleted]
        
       | scotty79 wrote:
       | Funny thing would be to find most unique user account
       | stylistically.
       | 
       | Which user has lowest best match?
       | 
       | Mine is 0.58 so I'm really not that unique.
        
       | ggerganov wrote:
       | I really liked the informative and straight-to-the-point about
       | page - describing how the algorithm works in a way that is easy
       | to understand. All the important details are summarised there.
       | Well done!
       | 
       | Edit: From the "How to avoid .." page, there is the following
       | sentence:
       | 
       | > Also, most authorship identification algorithms have poor
       | accuracy when working with small amounts of words. This means the
       | optimal strategy would be discarding an account either after
       | every comment or after a small number of comments. Unfortunately,
       | this is against HN rules and may result in a ban.
       | 
       | Can you clarify what this means and why it would result in a ban?
        
         | costco wrote:
         | > Can you clarify what this means and why it would result in a
         | ban?
         | 
         | I have seen dang respond to users multiple times asking them to
         | stop making new accounts especially but not always if it's to
         | avoid rate limiting. I don't know if there's an official policy
         | but it's definitely something I recall.
        
         | krisoft wrote:
         | > Can you clarify what this means
         | 
         | Imagine that for every new comment you want to post you would
         | create a brand new account which you would use precisely once
         | and never again. Then the stylometry would have just a few
         | words and wouldn't have enough corpus to get a reliable
         | signature. If a lot of people does this it would be hard to
         | figure out which account belongs with which human. ( Of course
         | if you alone do this, your messages will stick out like a sore
         | thumb. See xkcd 1105 )
         | 
         | > why it would result in a ban?
         | 
         | Because this practice is especially discouraged in the
         | guidelines: "please don't create accounts routinely. HN is a
         | community--users should have an identity that others can relate
         | to."
        
           | stupendous_luck wrote:
           | At the same time, HN doesn't let you delete comments.
           | 
           | Maybe with some GDPR magic.
        
             | krisoft wrote:
             | Not sure what is your point, or how does that connect with
             | my comment. Care to elaborate?
        
               | stupendous_luck wrote:
               | Your comment quotes an HN guideline, and my point relates
               | to it. Some users may feel the need to create throwaway
               | accounts in order to post comments that in an alternative
               | reality they could post under their primary account and
               | later delete if desired. It may not stop a scrupulous
               | collector of data, but such a scenario may not be the
               | object of their worry.
               | 
               | Drawing this into the logical conclusion, a user may opt
               | to always post under a throwaway account, to avoid any
               | possible tainting associated with a primary account.
        
       | jaredsohn wrote:
       | Amusingly can't run it on the author since not enough comments
        
       | joshstrange wrote:
       | Very interesting, .59 is my lowest, .64 is my highest match, none
       | of these accounts are one of my alts. Though to be fair the
       | handful of times I've used a throwaway I used it for a single
       | comment so I didn't give it much to go off.
        
       | sedatk wrote:
       | I have no alternate accounts, and all my matches are below 0.4
       | for whatever it's worth.
        
       | SevenNation wrote:
       | > ... This site works primarily by analyzing for each user the
       | frequencies of the most common words and phrases in the English
       | language. Accordingly, the easiest way to avoid being identified
       | is to simply use different words than you ordinarily would when
       | writing. More sophisticated models than the one I made can use
       | punctuation, comma usage, and capitalization to identify you so
       | try alternating those as well. Services like Quillbot can help
       | with you this but depending on your circmstances you may not want
       | to send your writings to a third party service.
       | 
       | HN offers many other threads which could be tied together,
       | including:
       | 
       | - time of posting
       | 
       | - ratio of replies to top-level comments
       | 
       | - comments being mainly upvoted or downvoted
       | 
       | - sentiment (mostly angry, dismissive, questioning, etc.)
       | 
       | - most common topics (keyword analysis of post being replied to)
       | 
       | - ratio of new posting to post replies
       | 
       | - first-to-comment on a post
       | 
       | - lone comment on a post
       | 
       | - etc...
       | 
       | It seems very likely that sooner or later every pseudonym for
       | posting content will get discovered and linked. The lesson here
       | is don't post anything that would cause you undue shame or harm
       | if linked directly to your legal name.
        
       ___________________________________________________________________
       (page generated 2022-11-26 23:00 UTC)