[HN Gopher] Show HN: Using stylometry to find HN users with alte... ___________________________________________________________________ Show HN: Using stylometry to find HN users with alternate accounts Author here. This site lets you put in a username and get the users with the most similar writing style to that user. It confirmed several users who I suspected were alts and after informally asking around has identified abandoned accounts of people I know from many years ago. I made this site mostly to show how easy this is and how it can erode online privacy. If some guy with a little bit of Python, and $8 to rent a decent dedicated server for a day can make this, imagine what a company with millions of dollars and a couple dozen PhD linguists could do. Here's Paul Graham: https://stylometry.net/user?username=pg Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns) Author : costco Score : 394 points Date : 2022-11-26 18:03 UTC (4 hours ago) (HTM) web link (stylometry.net) (TXT) w3m dump (stylometry.net) | oblib wrote: | I've only had one account here. The highest match has a 0.624 | score and the lowest a 0.572. I'm not sure if that means I'm | unique or common but I'd like to know. | macintux wrote: | My nearest match is only at 0.406. It'd be interesting to see who | the most unique commenters are, but it's also quite possible it | wouldn't be flattering. | joisig wrote: | 0.2506 is my nearest match | pubby wrote: | 0.35 is my nearest. In hopes of lowering it even further, here | are some nonsensical opinions never expressed on HN before: 1) | Programming peaked with COBOL 2) Paul Graham is responsible for | 90% of SIDS cases 3) There's no reason to use car when cdr | exists. | seydor wrote: | Well the only solution is too have too many alts so that nobody | can believe you can possibly have that many | WalterBright wrote: | Over in the D language forums, we welcome people who post under a | pseudonym, and our policy is we won't allow attempts to unmask | them. | | This is to protect high profile users who are secretly enjoying | programming in D rather than the language they are supposed to | use. | | And, of course, to protect users who feel they might be | discriminated against if their background was known. | bo1024 wrote: | It's very important for those people to be aware of these style | analysis attacks! Glad this post is raising awareness. | [deleted] | [deleted] | xwolfi wrote: | Wow... how ! | notduncansmith wrote: | This has been a great way to find people whose commentary I | enjoy! | dsr_ wrote: | This is interesting. | | I'm 0.566 correlated with logfromblammo -- and while we are | definitely not the same person, I could easily imagine writing a | sentence such as: | | "For some bizarre reason, management has not yet assigned a task | to their programmer underlings to automated themselves out of | existence. I can't imagine why." | | which is theirs, not mine, from about a year ago. I like that. | | On the other hand, I'm nearly as correlated with peterwwillis: | 0.5485 -- who has no comments and no submissions. | costco wrote: | > On the other hand, I'm nearly as correlated with | peterwwillis: 0.5485 -- who has no comments and no submissions. | | This is due to the Firebase API not updating when users ask the | admins to move their comments to another account. | lifeisstillgood wrote: | I had a similar experience finding my most likely alt (.50 | suggesting I am a unique snowflake as I have always thought | :-), my most likely alt is writing certainly in a style I | appreciate and on subjects I often mention. | culi wrote: | Similar to how they make adversarial fashion[0][1] in order to | not be tracked by face id AI, I wonder if we can make adversarial | stylometry tools to run your comments through in order to | anonymize it | | .. [0] https://hackaday.com/2022/10/20/render-yourself-invisible- | to... | | .. [1] https://adversarialfashion.com/ | carewell wrote: | OP links to a paraphrasing tool on their website. | pugworthy wrote: | Strip leading/trailing white space from the name if it says no | match. | robertlagrant wrote: | Clicking on my top match (0.61) - I can see the similarity. I | also note they quote the same way, with a > symbol. I wonder if | that helps! | lostmyacctoops wrote: | I'd be _very_ curious to know if these algorithms can link very | different _types_ of text. I 'm not surprised that my style is | "derivable" on HN, but what if you included my slash-fic pieces, | my research papers, etc, would it still "catch" me? | | Also, talk about a chilling effect. I was already vaguely aware | of this, and now I'm overthinking every word I'm thinking/typing. | uberduper wrote: | I would have expected to be a closer match to myself. | | > uberduper: 0.9999999999999991 | birdyrooster wrote: | sdwr wrote: | Ooooeeoo oo oooo. Ooooeeooo oooo<<barbra striesand>> | wizzwizz4 wrote: | Yes, sadly. In this case, it'd be an arsehole move, but good | point. | phnofive wrote: | If you want to ask HN to remove your data, send a message to | hn@ycombinator.com. | CharlesW wrote: | Not to diminish one bit how you're feeling, but the bright side | is: Today you know this is easily done (information you didn't | have yesterday), that the creator had no intention of "outing" | you specifically, and that you can take steps to obfuscate this | specific aspect of your posts that connects your public alts. | dibt wrote: | Since it looks for similar word usage, false positives seem to | appear more often when specific topics are talked about, like | stocks or crypto. | | Does this ignore stop words? Or do all words have the same | weighting? I wonder if only focusing on stop words would give a | more accurate measure. Maybe we are more comfortable with certain | stop words more than others? | | https://en.wikipedia.org/wiki/Stop_words | | "Stop words are the words in a stop list (or stoplist or negative | dictionary) which are filtered out (i.e. stopped) before or after | processing of natural language data (text) because they are | insignificant." | costco wrote: | All words have the same weighting. I don't ignore stop words, | in fact most of the ngrams I use are compromised almost | entirely of stop words. Maybe it'd be more effective if I | ignored them. | afarviral wrote: | Im tempted to use it to find likeminded friends :) | SnowHill9902 wrote: | Anything like this for Reddit? | | Would translating to other language and back defend against this | algorithm? | costco wrote: | > Anything like this for Reddit? | | No but it would be easily adaptable especially given that | Pushshift is archiving every Reddit comment. Based on some of | the feedback I'm getting here I don't know if I should open | source this even though it really wasn't that hard to make. | | > Would translating to other language and back defend against | this algorithm? | | Yes. But then you have to send your original comment to a | translation company so there are privacy concerns there too. | operator-name wrote: | I wouldn't worry about that too much as someone's already | done something similar for reddit | (https://towardsdatascience.com/using-nlp-to-identify- | reddito...), and has released their code publicly | (https://github.com/jabraunlin/reddit-user-id) | | Given the technique used, I don't see why something simple | and local wouldn't defeat it? The "easiest" technique would | be to use this weighting as a negative metric in rewriting. | hcs wrote: | > But then you have to send your original comment to a | translation company so there are privacy concerns there too. | | There are modern offline translation systems available such | as Bergamot https://browser.mt/ | EMIRELADERO wrote: | > Based on some of the feedback I'm getting here I don't know | if I should open source this even though it really wasn't | that hard to make. | | I'd say you should. I'd rather see this as being publicly and | freely available to everyone rather than some shady "Big | Tech" analytics company. | | If the "weapons" exist, I would feel more comfortable knowing | everyone can access them, not just an elite that can use it | for their own (selfish) purposes. | A4ET8a8uTh0 wrote: | I am genuinely torn, because my initial reaction was almost | the exact opposite, but the comparison to a weapon does | ring true. And there is indeed an argument to be made for | level playing field. At the very least, maybe counter- | measures can be developed. | Terretta wrote: | People don't usually understand privacy risks till their | own curtains fall down. | [deleted] | AtlasBarfed wrote: | What's a high correlation number? | ThrowawayTestr wrote: | Haha, you got me and my main account. That's spooky. | [deleted] | jonnycomputer wrote: | Obviously the next thing to do is make this a popup on someone's | account name when you hover over it. | psychphysic wrote: | Hmmm, doesn't seem to work. But you have convinced me (and many | others?) to search our alts consecutively and so now do know who | has alts? | elorant wrote: | Sounds like a nice tool to find friends. You locate people who | might think like you. | RepAgent wrote: | What's up with cluster of users like: | | j_s,password4321,carolinew,colinwright,kuharich etc. | | https://stylometry.net/user?username=j_s | https://stylometry.net/user?username=carolinew | https://stylometry.net/user?username=colinwright | https://stylometry.net/user?username=password4321 | | Lowest match for j_s is 0.80 and all but one is black. | [deleted] | saurik wrote: | Why are some users bold? | srean wrote: | The non-bold are dead accounts I think | saurik wrote: | It isn't due to a mere property of the user, as, for example, | cushman is not bold as the #2 result for tptacek but is bold | as the #2 result for icambron. | srean wrote: | Good point. | stavros wrote: | FYI, the GP said above that bold usernames are those for | which symmetry holds (ie they're both in each other's top | ten). | costco wrote: | Say you see user2 listed in bold on user1's page. That means | that user1 is also in user2's top 20 users. In my experience it | is often an indicator of a good match (but not always). I | should probably explain that on the site. | layer8 wrote: | Instead of making it binary, you could use a gradient | indicating the strength of the mutual correlation (like how | HN colors downvoted comments). | franze wrote: | totally on spot | | my current and my old account | jonnycomputer wrote: | Well, one of the closest on my list is my twin, so there's that. | Retr0id wrote: | It didn't find my alt, but the second match is one of my twitter | mutuals - I wonder if we've inadvertently borrowed style quirks | from each other. | [deleted] | hgsgm wrote: | [deleted] | samwillis wrote: | Sticking myself in (I haven't ever had another account) my | closest match (at 0.43) is the maintainer of an Open Source | project which I have occasionally commented about. They are also | British, as am I. | | My guess is that as they commonly mention the project and I have | on a number of occasions, that has formed the link. Plus maybe | usage of common British terms, but that seems far less | significant. | | It's super interesting! | | It would be good if there were more controls to filter the type | of words and language that are used for the matching algorithm. | So you could say exclude words not in the dictionary. I wander | how that would effect my link with this other person. | WaitWaitWha wrote: | I checked a few random user names and I am confused. | | - Why is the author costco[0] not in this lookup? | | [0]: https://stylometry.net/user?username=costco | [deleted] | Aachen wrote: | - Their first comment and submission were 4 hours ago. | | - The text on that page is accurate it seems. | julienreszka wrote: | why is my username not exactly equal to 1? | https://stylometry.net/user?username=julienreszka | costco wrote: | Python/floating point rounding error. It doesn't mean anything. | bhaney wrote: | Well now I'm self conscious about my closest match being an 0.34 | when so many other people are reporting much closer matches with | accounts that aren't alts. Do I write weirdly? | spapas82 wrote: | Same for me, the closest match is 0.36. But I expected that | because I don't speak english very well so the pool of | candidates is small. | CobaltFire wrote: | My closest is 0.40, so I'm right there with you. | | Native English speaker as well. | klohto wrote: | 0.36 here! Out of curiosity, are you a native speaker? | bhaney wrote: | I am, yes. | quink wrote: | 0.39 for myself, I'm a non-native speaker. | stephc_int13 wrote: | What is the threshold to be reasonably confident that two | accounts are from the same individual? | | I ever had only one account here and the closest match is at | 0.47. | jefftk wrote: | Tried my account thinking "I don't have any alts" but it turns | out I do! In 2018 I changed my username from "cbr" to "jefftk" | and it pulled that right up: | https://stylometry.net/user?username=jefftk | CobaltFire wrote: | Interesting; I must have a fairly unique style as there are no | matches over 0.40 for me. | | I'm a native English speaker as well, so I'm unsure how to feel | about that. | SkyMarshal wrote: | Oddly, I am not an exact match to myself. | | _> Most likely candidates: | | skymarshal: 0.9999999999999997_ | | The other few usernames I tested (pg, dang, some random ones from | this thread) all matched themselves at 1.0. | ChrisMarshallNY wrote: | Interesting, but it gave me 20 accounts, and I _know_ that I only | have this one. | notacoward wrote: | Seems pretty spot-on to me. I tried it with two accounts I was | already certain were alts - based on other factors like favorite | topics and common enemies as well as style/tone - and the top | hits for both were the ones I would have expected. | nwiswell wrote: | I don't have an alt but it would be cool to meet my stylometry- | neighbors. I'm curious whether the writing similarity translates | to oral communication too | kiernanmcgowan wrote: | Love a little NLP project on a public dataset - thanks for | sharing! | [deleted] | jimhi wrote: | Amazing and I thought my doxxing tool was terrifying - | https://news.ycombinator.com/item?id=32278871 | | I am afraid to combine all these methods | lijogdfljk wrote: | Yea.. i guess it's time to stop bothering with alt | accounts/etc. I'll just make one account, maybe differently | named on different services (makes scraping just a _pinch_ | easier) but aside from that all i can do is modify/remove old | posts. | | Bit of a shame for useful posts/discussions.. but the internet | is getting really.. finger print laden. | timeon wrote: | I had hard time to understand some comments made by my closest | match. I guess this is good reality check. I need to learn how to | write more legible posts now. | FartyMcFarter wrote: | Sorry, what did you mean? :P | schappim wrote: | Interesting that the Op doesn't come up in the search: | https://stylometry.net/user?username=costco | Beltalowda wrote: | Not surprising considering the account had no activity before | today. | Aachen wrote: | Their first comment and submission were 4 hours ago. Text on | the page is accurate it seems. | 4qz wrote: | This is an evil website. We won't have any anonymity soon. The | highest match is my years old banned account that I forgot about. | Where did you get the data from? | JadeNB wrote: | > This is an evil website. We won't have any anonymity soon. | The highest match is my years old banned account that I forgot | about. Where did you get the data from? | | I'd way rather have someone tell me "look at all the things I | can find out about you" so that I can act accordingly (whatever | that means!) rather than what we've mostly actually got, which | is companies silently exploiting my data and doing everything | they can to mumble reassuring but legally ineffective formulas | assuring me that they deeply respect my privacy. | costco wrote: | HN Firebase API. I just wrote a program in C++ with libcurl to | get https://hacker-news.firebaseio.com/v0/item/1.json, | https://hacker-news.firebaseio.com/v0/item/2.json, | https://hacker-news.firebaseio.com/v0/item/3.json, ... | ufmace wrote: | I don't know that I'd call this evil. We have no idea who else | is using this kind of technology but not making the results | public. Better to know what's possible and take measures to | make it less effective. | weinzierl wrote: | Please don't shoot at the messenger. costco shared this | voluntarily and I can see no bad intention. | | We should see it as an opportunity to learn how easy it is to | associate different pseudonymous accounts. Nothing drives this | point home better than a practical demo. | | We can be pretty sure stylometry is used widely by bad actors | already and we should not punish people who help to spread the | word about these technical possibilities. | ghaff wrote: | And this is actually quite a simple approach--which is | interesting in and of itself. While there would be | diminishing returns, there are a ton of other techniques you | could use to make stronger inferences about similarity. | [deleted] | vfinn wrote: | Imagine using this across different platforms :/, and let alone | using different techniques in addition... | | edit: maybe you'd catch some criminals if you tried to match | reddit against dark web for example | woodruffw wrote: | HN has an Algolia-based API. It's also _very_ easy to crawl. | | I wouldn't call this evil, however: it's merely demonstrating a | technique that you _should_ be aware of, if you're a privacy- | conscious person. It looks like they also provide some | resources for avoiding stylometric detection. | nanidin wrote: | I would bet my bottom dollar that the likes of Reddit and | Google already have models to turn a corpus of text into | probable demographic data and models to measure the | similarity of users. | faeriechangling wrote: | It's just statistics. I recall that during his whistleblowing, | Snowden intentionally took anti-stylometry measures. | wizofaus wrote: | What match level would you expect to see between two randomly | chosen individuals? | seydor wrote: | does it use the _most_ used words or least used? | [deleted] | yyt554 wrote: | Fun exercise would be to find all accounts that suddenly stopped | posting around today and correllate them with new accounts | created around today. | | All those scared folks who naively think that it's not too late | yet. Busted. | super256 wrote: | Ahhh, anyone remembers this hacking crew who leaked BLUEETERNAL | and other NSA tools and exploits? Shadowbrokers. | | They were always communicating in some kind of meme-russian, and | their texts were funny to read. [1] | | I believe their writing mostly defeated this kind of analysis, at | the cost of looking like idiots (which was probably the reason no | one sent them crypto-dollars to buy that stuff exclusively). | | Here's an excerpt: | | "Attention government sponsors of cyber warfare and those who | profit from it !!!! | | How much you pay for enemies cyber weapons? Not malware you find | in networks. Both sides, RAT + LP, full state sponsor tool set? | We find cyber weapons made by creators of stuxnet, duqu, flame. | Kaspersky calls Equation Group. We follow Equation Group traffic. | We find Equation Group source range. We hack Equation Group. We | find many many Equation Group cyber weapons. You see pictures. We | give you some Equation Group files free, you see. This is good | proof no? You enjoy!!! You break many things. You find many | intrusions. You write many words. But not all, we are auction the | best files." | | [1] | https://archive.ph/20160815133924/http://pastebin.com/NDTU5k... | lettergram wrote: | Did something similar in 2018 (still running locally) which could | damask anyone | | https://twitter.com/austingwalters/status/104189476543920128... | | Made both Metacortex.me and insideropinion.com | | The idea being you don't actually need an active directory. It | would drop in, figure out all the users (provided one account was | on the AD) and would monitor everyone's skill sets, morale, | schedule, etc. Worked super well for what it was / is. | thr0v_awway wrote: | writing from throwaway: | | Holy shit, it works really, really good. It found all of my older | accounts. | oliwary wrote: | Cool! I wonder if it could be run backwards, to identify the | users on hackernews with the most unique voices. | Wistar wrote: | I have only ever had a single account but it returned 19 | possibles with no confidence above .54 but 11 bolded. My own | account was listed at the top with a confidence of .9999. | Macha wrote: | Yeah, I have a bunch of bolded mutuals but none above 0.45. I | think I have had one or two alts in the past, but probably they | didn't make the 10000 word threshold for inclusion (nor can I | remember their names to check if they work in inverse). | [deleted] | [deleted] | karol wrote: | Are you going to try it on Twitter? | srean wrote: | I tried it on a few user-ids that I strongly suspected were owned | by the same person. My hunches stand corroborated. Not sure who | is corroborating whom though, me or the script. | | Good job. | msla wrote: | It puts almost all of my old accounts decently near the top, but | my original account is almost comically low. | ALittleLight wrote: | Of the top ten accounts listed for my name two of them are me. | zem wrote: | heh, I looked up the top bold hit for my name and they really do | sound a bit like me (: | dibt wrote: | This doesn't seem to include text from submissions. | | I ran it on Brian Armstrong's temp account from here, and it said | it didn't write 10,000 characters: | | https://news.ycombinator.com/item?id=3754664 | | EDIT: Or maybe it's something else because Brian only wrote less | than 6k characters. But then why can my account be looked up? | | Also, I would guess quoted replies are included, which muddies | the analysis. Seems to be a very naive implementation. Much more | can be done, but this was probably just a quick project. | costco wrote: | Quoted replies shouldn't be included unless there's a bug on my | end. Submission text is not included though I probably should | have. | anpat wrote: | This needs to exclude who's hiring post because it confuses me | with a few of my wonderful former colleagues! | [deleted] | antirez wrote: | writing "antirez" shows accounts with spanish names (none is | mine). I guess Italian and Spanish speakers write very similarly | English, but on HN there are a lot more Spanish speakers than | Italian ones so that's what I get. | operator-name wrote: | What does the bold signify? For example when I search for dang | (https://stylometry.net/user?username=dang) the 4th most likely | user is not bold whereas the 16th is? | costco wrote: | Say you see user2 listed in bold on user1's page. That means | that user1 is also in user2's top 20 users. In my experience it | is often an indicator of a good match (but not always). | operator-name wrote: | Huh, that's a somewhat non intuitive property. | silasdavis wrote: | It is a bit, but if stylometric equality was a thing you'd | expect it to be symmetric, so if stylometric simmilarity is | a thing.... | Trouble_007 wrote: | Nice work! Thank you, of course I plugged in the obvious HN | usernames | | Edit to add; | | Would be nice to have the | https://news.ycombinator.com/user?id=username links included. | Trouble_007 wrote: | And perhaps rounding to 3 or 4 decimal places? | iHateStylo wrote: | mysterydip wrote: | I was curious to use this on myself to see if anyone writes like | me. Closest was a .51 confidence, so I guess not? | harryvederci wrote: | My runner-up has a rating of 0.42378790667730715 | | C'mon guys, work harder. That's not even close! :-D | | Btw, I myself am only at 0.9999999999999999 so I guess I need to | work harder at being myself. | iambateman wrote: | This is cool! | | If an account returns a high score for many accounts, does that | also mean they're relatively less original in style? | medellin wrote: | How much writing do you need to analyze results? Would changing | account every X sentences eliminate this? | costco wrote: | Current minimum is 10000 characters. In my own tests accuracy | was still pretty good at 3-5000 but I instituted the 10000 | minimum to reduce false positives. Yes it would, if you read | the advice page on avoiding detection that is one of the things | I recommend. Unfortunately HN moderators do not really like | that. | rmelhem wrote: | nice one. are you using gpt3 under the hood? | costco wrote: | I'm not that smart - my site is basically just doing some | calculations on word frequencies. You can read | https://academic.oup.com/dsh/article-abstract/17/3/267/92927... | and | https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53... | and https://news.ycombinator.com/item?id=33755898 for more | information. | dunham wrote: | I am curious whether it could pick GPT3 out of the crowd. | ghaff wrote: | As you mention on the site, you don't do punctuation. But I'm | guessing there are some pretty good fingerprints like: | | two spaces after a period | | Whether someone uses an em-dash/single hyphen/double hyphens | (which may correspond to house style they're used to) | | Whether they use semi-colons | | (Presumably harder) but consistent substitutions like loose | for lose, break for brake, etc. | | Use of accents | sillysaurusx wrote: | Don't sell yourself short. Simplicity is smart. It's | astonishing how often the simplest thing turns out to be | exponentially more effective than the so-called smart thing. | | I can't get over how phenomenal this is. Please put every one | of your side project ideas into production! | isoprophlex wrote: | Simplicity is the greatest form of sophistication! Great | work! | | One small nit from a user experience point of view..: it'd be | easier on the eyes if you just truncated those cosine | similarity scores (or whatever score you're using) after the, | say, 5th digit. Showing the entire float is kinda messy to my | eyes. | Dma54rhs wrote: | Its easy to write complicated systems, it takes a genius to | make it simple. | rmelhem wrote: | cool and thanks for the clarification. i ask that mainly | because of the request limit of openai, which is something | that makes many scalable ideas unfeasible | godisdad wrote: | Can we find Satoshi with this? | thisisnotapipe wrote: | Cardano founder Charles Hoskinson believes that only one person | fits the profile of the mysterious Bitcoin creator, Satoshi | Nakamoto. | | In a surprise Ask Me Anything (AMA) session on YouTube, | Hoskinson reveals that he has narrowed down his search to one | person who he believes is the only individual that fits the | part. | | "I've been very vocal lately on this. I think that first, it | doesn't matter but second, it's probably Adam Back. If you look | at the preponderance of the evidence, Occam's razor applies and | the most likely answer usually is, and there's no mystique or | magic there but he just fits the profile. You're looking for | somebody who's in their 40s to 50s who created Bitcoin in 2008. | That would fit Adam. English education, grammar, all that | stuff, the right computer science background, exactly the right | credentials you'd look for. You probably can get pretty far | with code stylometry towards validating that." | | (from https://tokenpost.com/Charles-Hoskinson-Believes-One- | Person-...) | drpancake wrote: | A few people have tried that e.g. | https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184... | crecker wrote: | See also: https://serhack.me/articles/unveiling-anonymous-author- | stylo... | costco wrote: | That post was actually what motivated me to make this. I'm on | your email list :) | crecker wrote: | WOW! It's such a pleasure for me | neodypsis wrote: | How do you protect yourself from impersonators? | nvr219 wrote: | I only got 0.9999999999999992 for myself :( | noncoml wrote: | Naturally Born Imposter | bumble_bee900 wrote: | It's accurate enough that I had to create a new account now :) | | I guess it's difficult to evade it as the word frequency | certainly catches all about the countries I frequently refer, | programming languages, interests etc. | [deleted] | [deleted] | p4bl0 wrote: | It's funny that I only match at 0.9999999999999982 with myself | while all other username I tried matched with themselves at 1.0 | ^^. | srean wrote: | https://theuijunkie.com/myth-or-fact-did-charlie-chaplin-los... | [deleted] | sillysaurusx wrote: | Wow. This gives a lot of false positives, but it found all ~10 of | my old accounts over the years. | | The most interesting thing is that my writing style changed | pretty drastically since a decade ago. Searching for my oldest | account matches my earliest usernames, whereas searching this | account matched the rest. | | The details of the algorithm are fascinating: | https://stylometry.net/about Mostly because of how simple it is. | I assumed it would measure word embeddings against a trained ML | model, but nothing so fancy. | [deleted] | FormerBandmate wrote: | > sillysaurus3 | | > sillysaurus2 | | Tbf a human could have found a bunch of them relatively easily | lettergram wrote: | Frankly similar to how I was doing in back in 2018 (when you | and I chatted about it on HN lol) | | https://news.ycombinator.com/item?id=17944293 | | The approach I took was a bit different, but also no ML | required. | | The real trick is pruning and going cross platform. There are | around 100k active HN accounts (meaning posts a few times a | year), maybe 200k if you count at least one post a year. But | <10k that post weekly. | | It's a very small space to try to compare so simple methods | will work fine. | costco wrote: | Exactly. HN emphasizes long-form posts much more than other | forums which makes the commenters here very susceptible to | this kind of analysis. Plus you can fit every single HN | comment in RAM on a mid tier gaming laptop so it's even | easier. I was trying to think of applications of this kind of | data and the only thing I could think of was moderation | tools/detecting ban evaders but what you've done seems much | more profitable lol. | echelon wrote: | It works like a charm for me too. | | I put in my username and found my pre-echelon alt, | possibilistic. | | (Echelon was taken when I registered possibilistic, but it must | have been unused and dropped.) | costco wrote: | Yeah top 20 is a little excessive because in my own tests I | found that top 20 is only marginally more accurate than top 10. | You can get a more academic explanation [here](https://www.tand | fonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed | too because it seemed too easy! | sillysaurusx wrote: | FWIW, top 20 was necessary for mine. The bolding was a | brilliant move. Several of my accounts were ranked 10-20, but | popped out due to the bolding. | justusthane wrote: | What does the bolding indicate? | sillysaurusx wrote: | The explanation is here: | https://news.ycombinator.com/item?id=33755466 | | As far as I'm concerned, it's the killer feature of the | app. The top 20 results may be noisy, but the bolded | results have a signal to noise ratio close to infinity. | costco wrote: | The funny thing is that I thought of it while eating | dinner last night :) | jsnell wrote: | The precision of the bolded results looks like maybe 30% | to me. Significantly better than the non-bolded, but | nowhere near perfect precision. | costco wrote: | False positives become an increasingly difficult problem | the more and more potential authors you introduce. If I | had wrote a fancier model it probably wouldn't be as much | of a problem but what can you do. | jsnell wrote: | Yes, this wasn't a criticism of the tool. It is crazy | good. | | But I don't think people should be making the assumption | that bolded results are definite alts, which sillysaurus' | comment reads like. | sillysaurusx wrote: | Hmm, that wasn't my intent. I see this tool as a | recommendation engine more than a doxxer. By "signal to | noise ratio close to infinity," I meant that if you visit | one of the bolded accounts, they'll probably sound a lot | like you. | | It's one of those ideas that makes the tool substantially | more effective, yet never would've occurred to me. It's | like the simplicity of pg's "a plan for spam" algorithm: | deceptively simple, but (like scrubbing dishes with | fingers) works really well. | dragonwriter wrote: | Of my top 20, 19 are bold, all are above 0.6, and I have | no alts. | loeg wrote: | I have 7 bolded names (0.53-0.62) in the top 20 list, and | none are alts of mine. | morsch wrote: | I'm one of them and I can confirm. But then again that's | what I'd say if I was. | loeg wrote: | Hi style-adjacent friend :-). Just briefly looking at | your recent comment history, we seem to find different | kinds of articles interesting, but maybe have a similar | writing style. | ghaff wrote: | Pretty much the exact same. (I do have a throwaway | account but I rarely use it and it probably hasn't been | used enough to qualify.) | User23 wrote: | I'd figured it would be some kind of n-gram frequency analysis. | Would be interesting to code that up and compare. | costco wrote: | It is. The description on the about page is a little | simplified but I basically I look at the most common word and | character ngrams of size 1,2,3 (200 each), put all the | frequencies in an array and then compare to all the other | users with https://scikit- | learn.org/stable/modules/generated/sklearn.me.... | User23 wrote: | Cool, I only skimmed the description maybe I needed to read | it more carefully. | | Have you considered doing rune rather than word ngrams? I | can imagine that might be prohibitively expensive, but I | really don't know. I did something like that long long ago | in C for automatic document language detection. It was | quite accurate. | throwaway5434q wrote: | Wow. This is insane, it found my old accounts. So throwaway | obviously (because I'm a bit of an asshole) but this really is | amazing. It also highlighted another account that's not me, but | looking through their comments i don't see any resemblance to me | either. | stavros wrote: | Oh wow, it's really sure that I'm stavrosk, which I am: | | https://stylometry.net/user?username=stavros | | The next person is 30% less certain, that's huge! This would | basically identify any alt I might have with near certainty. | jvolkman wrote: | stavrosk doesn't have any posts/comments? What's it using to | match? | stavros wrote: | It's my old username. | costco wrote: | Huh... seems there are some inconsistencies between what's | presented on news.ycombinator.com and the Firebase API. | Glad it matches for you though :) | stavros wrote: | I guess they just didn't go back and reparse, not a big | problem. I don't think people change their username | frequently :P | rogual wrote: | Funny thing is, it thinks I'm you, but it doesn't think you're | me! | | https://stylometry.net/user?username=rogual | | I'd have thought this stylometry thing would be commutative. | ed25519FUUU wrote: | Very cool! And really a shame that you're not allowed to delete | an old alt account or comments on HN! It follows you forever | apparently. | Arathorn wrote: | It found my old account (ara4n; i lost the password) at 0.63. | More amusingly it found my cofounder too, who hardly ever posts | here (at 0.48) | thot_experiment wrote: | Maybe this is a good tool to find new friends. :P | pkos98 wrote: | Sorry dang, aka sctb: https://stylometry.net/user?username=dang | Macha wrote: | In this particular case, it seems to be picking up the stock | moderation responses as it looks like sctb was a moderator | account until 2019. | alpacabag wrote: | Semaphor wrote: | My alt accounts (not really, all below 0.5) seem to also be | European or German Firefox users. Good for us ;) | nr2x wrote: | Honeypot to see what accounts are tested in sequence? | | ;-) | costco wrote: | I turned off nginx logging if that makes you feel any better. | Of course there's no way for you to verify that because I'm | just a random guy on the internet but I will tell you that I am | a civic minded citizen who is concerned about privacy and the | Internet. | nr2x wrote: | Only half kidding, but I'd I were state Intel it's what I'd | be doing. :D | atum47 wrote: | at what threshold is it considering alt account? | costco wrote: | There is no threshold. This site does not make any call as to | whether a user is an alt or not. It just gives the users with | the most similar word choice and from there it is up to you to | decide (is there a very specific detail that both accounts | mention, do they post at similar times, etc). I will say bolded | accounts are substantially more likely to be alts though. But | obviously it is not guaranteed that every user has an alt. | F_r_k wrote: | Found my phone account; I'm quite impressed, really ! | ufmace wrote: | I wonder what's a reasonable threshold for "probably the same | person". I've never had an alt on HN, and when I searched myself, | it found 3 other users above 0.6, none of whom I've ever heard of | before. | [deleted] | costco wrote: | If it's >0.9 is you can almost guarantee it's an alt but I've | seen certain matches at 0.6. The problem is writing styles | change over time. Another idea I had was converting the scores | which are just cosine similarity scores into percentiles (so | 0.99 would be 99th percentile of certainty) to make them more | human interpretable. | forgotpwd16 wrote: | >The problem is writing styles change over time. | | Will be interesting if we could plot the writing style | divergence over time. | throwdbaaway wrote: | I got matched with my old account with a score of only 0.45 | bonzini wrote: | The people at 0.4-0.6 with me do share some interests. That's | cool on its own. | throwup wrote: | I make new accounts every so often and the accounts of mine | that it found have a score of around 0.3. I'm not actively | trying to defeat stylometry but it's possible I just have a | particularly unremarkable writing style. | xwolfi wrote: | Well I must be stereotypical myself because it found me at | 0.8 ! | MBCook wrote: | I have no alts. The highest match for me is about 0.66. | dotancohen wrote: | Interesting. The highest non-me account is under 0.4 on my | page. I do not believe that I have such a unique writing style | - especially since half my posting is on mobile and therefore | possibly slightly different than my desktop posts. | dwringer wrote: | My closest is 0.4879. I know I tend to be wordy but I thought | I had a pretty generic style as well. This is definitely a | fascinating demonstration. | drdec wrote: | Feeling better about my high of 0.49 now | pyb wrote: | 0.6 is not high enough to indicate an alt | Yeahsureok wrote: | On the how to avoid section: Isn't running comments through a | randomised translator a few times then back considered a | countermeasure also? | | Also think it's probably poor form to list users as examples | without their permission. | costco wrote: | > On the how to avoid section: Isn't running comments through a | randomised translator a few times then back considered a | countermeasure also? | | Yes. | | > This may be out of line but isn't pg on here with a different | username, Levenschtein distance of one that's not included? Or | is that just a very motivated 13yo account who writes a lot of | admin-esque comments. | | What other pg account are you referring to? I want to see it so | I can see what my algorithm missed. | | > Also think it's probably poor form to list users as examples | without their permission. | | You're right. I'll remove that - I just wanted some examples | especially for people on phones who don't feel like typing. | Thanks for the feedback. | jacooper wrote: | > However, using automated methods like machine translation | services do not appear to be a viable method of circumvention. | | https://www.whonix.org/wiki/Stylometry | Lichtso wrote: | I wonder how much this can be improved if metadata is taken into | account as well. Especially the distribution of common post dates | and times modulo a week, which also exposes in which timezone | somebody probably lives. | 2OEH8eoCRo0 wrote: | This found an old account that I forgot I even had but with a lot | of false positives. Neat! | bscphil wrote: | The scary thing is that once you have this data, finding HN | matches for individual targeted users on other sites becomes | trivial, even if those sites are harder to scrape. I bet most | people here have an anonymous Reddit account, for example. If you | wanted to know who was behind a particular Reddit account, you | could feed it into something like this and compare the results | with HN, where accounts are less likely to be anonymous. Or build | a database based on blogs, Github comments, etc. | | Also, since this uses only word frequency, there are probably | relatively easy improvements to make that would make it even more | powerful, like looking at particular runs of words that are | unique. Some expressions or figurative language only show up in | combinations of words, and tend to be highly style specific. | faeriechangling wrote: | Thus proving the only actually anonymous community in practice | is 4chan, and that's why it's so toxic. | sbierwagen wrote: | If you define "toxic" as "people disagreeing with you", sure. | That was what the entire internet was like until maybe 2005. | philosopher1234 wrote: | "People disagreeing with you" describes almost none of the | conversation on 4chan | ben_w wrote: | I'm old enough to remember when 4chan was _self | identifying_ as the Internet 's hate machine, before xkcd | referenced it as such: https://xkcd.com/591/ | | Sometimes people insist that's all role-play and irony; | others insist that if it ever was, it certainly isn't now. | | But regardless, I remember pre-2005, and it wasn't all like | what I saw the two times I looked at 4chan. Bits were. Bits | were _much worse_. But mostly, _mostly_ , people were | kinder... at least, unless political tribalism came up. | costco wrote: | I could have used a part of speech tagger, looked at time of | day a user posts, capitalization, spelling errors, etc. From | what I understand the state of the art is lightyears ahead of | this, there are even companies with actual linguists who will | act as expert witnesses in court to say stuff like "we can say | with 95% certainty that xyz authored this email." Honestly it's | kind of scary. There are papers that talk about cross platform | authorship attribution, one I think did it with Twitter, | Blogspot, G+ and had pretty good results. | saurik wrote: | It would be convenient if the usernames linked to the comment | pages on Hacker News (to avoid having to copy/paste and URL hack, | which is made even slightly more annoying because for some reason | when I tap and hold the usernames to copy them your markup--I | haven't looked at why yet--is causing an extra space character to | get copied on the left). | honkler wrote: | Not today. | | You fail, I win. | costco wrote: | Nice. Just out of curiosity are you taking any countermeasures | or varying your writing style across accounts in any way? | psychphysic wrote: | My second closest match was 0.35 but searching people where | they have matches 0.5-0.75 I suspect that's mostly to do with | number of posts leading to better statistics. | soneca wrote: | I have two accounts. This one, "soneca", that is my first one and | most active by far, and another one that I use sometimes mostly | for Show HN and few comments. | | When I searched the other one, "soneca" was the first guess, with | 0.4. | | But when I searched "soneca", the other one was not in the top | 20. | 00F_ wrote: | ive had maybe a hundred throwaway accounts on HN over the past | ten years. generally, i make an account, say something that is | apparently wildly offensive to someone else, get flagged and | down-voted and then muted or hell-banned. then i make another | account because i never did anything wrong and start the process | over again. ive emailed the admins, tried to reason with the | admins, it never does any good. the power is held by power-users | who flag people -- most of the power of an admin at the end of | the day but without any of the accountability. as long as they | are following the mainstream dogma, its all good. | | anyway, this app was able to identify a lot of my accounts. but a | lot of the matches werent me. bold matches were almost all me. | but i know there are many more matches than those that were | listed. it mainly showed my most recent accounts. | | i think most people would get a sick feeling in their stomach if | they tried this app. i dont think people are prepared for a world | where you can type someones name into an app like this and | produce everything ever recorded online that was created by that | person. not only this but everything highlighted and summarized | to answer any question about that person. this is what advanced | ai will bring us. an information implosion where the planet-sized | ocean of data that is just floating all around us suddenly and | violently coalesces into the objects of our new societal | calculus. violent is a good word. and this is just the change | that one can see coming with ai. | costco wrote: | You are definitely right. Part of the reason I chose the 10,000 | character minimum was so that people using throwaways in the | true sense would be entirely excluded. I don't plan on keeping | this up forever and I too would not feel comfortable if this | was deployed at scale. | ayewo wrote: | Would you be open to open sourcing the code when you decide | to shutdown the service? | stupendous_luck wrote: | You really don't need advanced AI to do it. Just a bunch of | scrapers and some run of the mill statistics. And guess what, | it's been done by many companies already. They just don't care | to create such a site. | moneywoes wrote: | What algorithm is being used? | interroboink wrote: | It's described here: https://stylometry.net/about | rglover wrote: | It's moments like this I'm proud to have my insanity on full | display without obscurity. Was surprised to see a bunch of ~30% | matches despite not having any alts. | kfichter wrote: | Does anyone here have a reasonably wide variety of similarity | ratings? I'd love to see the difference between a 0.2 and a 0.8 | for the same account. | peacelilly wrote: | This is creepy. | noncoml wrote: | I think the word you are looking for is uncanny | jallasprit wrote: | Most likely candidates: pg: 1.0 | montrose: 0.604073065373204 mattmaroon: | 0.5900372458160795 natsu: 0.5519832271289953 | rauljara: 0.5418566694533273 waterlesscloud: | 0.5378996309342633 damoncali: 0.5292014150349463 | gruseom: 0.5290151637991445 kemiller2002: | 0.5254174524920762 jfengel: 0.5231938496089998 | jamesaguilar: 0.5229081613163672 houseabsolute: | 0.5219738531025365 danssig: 0.5195368367601849 | austenallred: 0.519343009683366 loewenskind: | 0.5177030083877397 baguasquirrel: 0.5153841099708854 | asdfasgasdgasdg: 0.5146704002447524 aptwebapps: | 0.5144149629369845 allenbrunson: 0.512802806408646 | danielweber: 0.5123620795710832 | [deleted] | andsoitis wrote: | we leave fingerprints everywhere | throwawayhghcj wrote: | I'd like to request the author takes this offline please until | the implications can be thought through. | | This is breaking anonymity that people incorrectly thought would | not be revealed. | | For some it might be awkward, others it might be quite | problematic. | s3000 wrote: | This is nothing new, e.g: | | Analyzing stylistic similarity amongst authors | | https://news.ycombinator.com/item?id=10050603 | | http://markallenthornton.com/blog/stylistic-similarity/ | | 37 points by lingben on Aug 12, 2015 | kaba0 wrote: | I would agree with you but the genie is out of the bottle | already. Nigh everyone can and could have reproduced these | results, especially that archive.org and similar things exist. | | So, I don't think it causes any new harm, if anything it gives | you future risk aversion. | silasdavis wrote: | The top hit for me, though not a very high correlation (0.3 ish), | is to my surprise someone I have met. I don't appear on their top | 20 though. | musicale wrote: | > I made this site mostly to show how easy this is and how it can | erode online privacy | | looks like it can indeed | | > Here are some frequent HN commenters: (EDIT: Removed due to | privacy concerns) | | How surprising that someone might object to being included in a | demonstration of the erosion of privacy! | | Is the site opt-in or opt-out? | Aachen wrote: | I doubt they asked 78k users for permission when there's no | standardized way of reaching out if you're not a site admin. | It's opt out if anything. | bee_rider wrote: | You opt into making your writing publicly available when | making posts on this site. I'm not sure what Ycombinator's | user agreement* says about this, but it is pretty obvious | that they haven't done anything to prevent it (and it isn't | clear what they could do). | | * and I mean they author of the tool is here making posts, so | I guess they have agreed to the TOS, but clearly someone who | hasn't agreed to it could also make this tool and scrape out | publicly available posts without agreeing to anything. | StrangeDoctor wrote: | Have you done any data analysis on distributions of similarity? | How similar you'd expect any 2 people to be given English focused | around tech? Or any other interesting stats you'd like to share? | | Very nice clean site, great work. | Ros2 wrote: | I interviewed years ago with someone who let me know that they | use a pseudonym as an employee and their chosen name even got | posted as the author for articles they wrote for the company. | They were very concerned about their privacy. | | I know their blog, which is their HN username, and this tool | found their other account. | | Perhaps ironically, this person stood out a lot because of this | and I didn't forget them. | zxcvbn4038 wrote: | How long until this becomes the algorithm for a dating site? | | "Find hot single women who write just like you" | nrp wrote: | This seems like a great way to hire freelance copywriters/ghost | writers too. I would absolutely hire someone I knew could match | my tone well for writing generic unattributed copy. | forgotpwd16 wrote: | Wouldn't be surprised if dating sites already used similar | algorithms. | bornfreddy wrote: | Wouldn't be surprised if most of the women on a specific | dating site had very high similarity scores. | davebillyhock wrote: | This found an alt that I created specifically to see if I could | write artificially to defeat this kind of analysis. I have seen | other tools like it posted to HN, but none before had found that | account. I guess I need to up my game. | [deleted] | CharlesW wrote: | If you don't mind sharing, are you "writing artificially" | purely in your head, or are you using techniques like | intermediate translations? | davebillyhock wrote: | No mechanical means, but I have referred to a thesaurus | occasionally. Mostly I tried to change my sentence structure, | not just words. It requires actually thinking differently, in | a way. Which makes it difficult to know how well I'm | communicating. | crtified wrote: | I imagine this would be quite difficult in practise, due to | all the subliminal factors behind a person's writing | choices. | | For example, as somewhat illustrated here, your personal | vocabulary is a kind of fingerprint. As you mention, using | a thesaurus can somewhat alleviate that, but if a thesaurus | is only changing a small % of your words, then it will only | have a suitably small % effect upon analysis. | | To go yet further might (I suspect!) entail methods such as | directly lifting and using other people's sentences to | convey your own thoughts. But even then, "your own thought | patterns" are still informing the manner of the post, to | some extent, so over time increasingly robust analysis may | still find patterns to hook into. | neodypsis wrote: | I wonder if someone will come up with a Grammarly-like | tool which you can feed with sample writings to help you | increase/lower the similarity score of a new text you are | writing. | ruined wrote: | didn't find a single one of my alts. nice | costco wrote: | I obviously don't expect you to help me but do they have at | least >10000 characters written and are you varying your | writing style in any way? | paulpauper wrote: | Inserting random Unicode blank, 1/4, 1/2, or zero space | characters into your writing may help thwart it too, if you are | paranoid | UncleEntity wrote: | Huh, that's how I signal my KGB handler... | lifeisstillgood wrote: | How much should we fear de-anonymisation ? | | A lot of discussion on the thread are over "how can we prevent | this". I would like to know why should we not embrace this and | similar technologies? | | The benefits in my view are large - online behaviour tracks back | to real life - and epidemiology speaking the value of millions of | test subjects across every question are invaluable - from | traditional medicine to "mass psychology recommendations" | | I can guess some downsides (hiding from abusive exes) but am | interested in studies, surveys, reports etc - any HN thoughts | welcome | headhasthoughts wrote: | What could possibly be the harm in allowing people to harass | others based on posts they made decades ago? What could | possibly be harmful in making a person who for whatever reason | has changed their online identity easier to track? What could | be remotely harmful about allowing Marlboro to find the | accounts of ex-smokers? What could be the harm in tracking | underaged users site by site? | | I'm sure this is completely harmless and will not harm society. | rejectfinite wrote: | >online behaviour tracks back to real life | | This is good to you? | | Okay, let's just make it like China or SK where your login is | your citizen ID and if you write bad things the bad word police | will take you away. | | Also, no, I have no alts. | lifeisstillgood wrote: | So I am asking because my views are only challenged inside my | own head, hence the need for external thoughts. | | But firstly the "governments will come and do bad things" | argument - yes this is clearly and obviously a major problem | - but not one solvable by technology in anyway. Fixing | violent dictatorships is a IRL problem - one that requires | enormous effort and sacrifices (see Ukraine for obvious | example). We cannot pretend that a browser extension or a | ground up rewrite of Twitter will defeat Putin or would have | stopped Hitler. | | As for "free" countries (something like 120+ have open free | elections), we still have online abuse for voicing opinions | that some people don't like (anything from pro/anti Trump to | LGBT and bitcoin etc). Those are real consequences but rarely | government inspired and honestly I suspect we need better | support for police in prosecuting such things - I mean a | death threat is a death threat. | | In general my view seems to be we should have the same | protections online as we do offline - and if those | protections are "in theory only" that requires us to use our | voting and other political power to chnage it - not to | obfuscate IP addresses or so on. | | The upside of tech is so great it is worth spending IRL to | defend agains the downsides | rejectfinite wrote: | I am of the generation and mindset that online abuse is not | real. Straight up. Log out, turn off the screen and watch | Netflix, take a walk and calm down, block the offending | user. It's not real. | | >I suspect we need better support for police in prosecuting | such things | | We do see that! But mostly people on Facebook. Here we have | had judgements of people who posted threats on Facebook | because it is tied to your real name. | | And yes, abuse is part of the "fun". Under your system, my | 10 years old Leauge and CoD chats would have me locked up. | | >I mean a death threat is a death threat. | | Is it? I would find it more concerning if someone on the | street tells me he is going to kill me than a kid on xbox | live. | | NOW there is a difference in systematic stalking and | harassment online if I would get bombarded with DMs and | messages to kys. I don't know how to solve. But a one-off | comment is NOT equivalent. Then it feels like I'm just old? | At 31? Is it really so serious? | femto113 wrote: | Fear it happening or fear its consequences? Doxxing already | happens all the time, but the main tools are things like | account names or image search, this sort of tool could take it | to a new level. A simple experiment would be to run this same | algorithm against another site (say Twitter or Reddit) and see | if it can reliably pick out the same peoples' accounts there. | Once anyone on the internet can quickly/easily draw that sort | of connection it would require incredible diligence to avoid | de-anonimyzation while still maintaining any sort of "real | self" presence on the internet. How much we should fear the | consequences probably depends a lot on how marginalized you are | within your society, but since just revealing your gender is | enough to invite harassment in many forums I'm not optimistic. | CrypticShift wrote: | Ingenious idea. At the very least, this is just about finding | people who write like us, the same way we seek those with similar | tastes (music...) | | How long before large commercial indexers start offering an | efficient (AI based ?) stylometry to agencies and states ? | | wait... do you think the NSA is already doing this? | A4ET8a8uTh0 wrote: | They would be silly not to ( apart from creepish profiling of | an entire globe population you also get to potentially identify | bots ). We all have mannerisms that can easily 'betray us' | online. I honestly thought my writing style is more unique, but | as it turns out it is somewhat common. | CrypticShift wrote: | > I honestly thought my writing style is more unique | | You just showed another possible use case for this kind of | tools: "How unique is my writing style ?" | sitkack wrote: | It isn't writing style, but more of phrase selection. If you | lean on the same phrases (n-grams), then you will be very | very close in a high dimensional space. Colloquialisms are | the biggest tell, you should eschew them. | woodruffw wrote: | Stylometry is an old hat technique; you can assume that | intelligence services around the globe regularly apply it. | | (Statistical stylometry is a little newer and more rigorous | than manual stylometry, which essentially involved a human | being's judgement call around the similarity of documents.) | CrypticShift wrote: | What about "deep leaning" stylometry ? | woodruffw wrote: | I don't know, but it wouldn't surprise me if someone has | tried to apply ML to stylometry. Statistical stylometry is | already petty effective, as demonstrated by this site. | weinzierl wrote: | I played a little bit with it and it is baffling how well it | finds accounts of people that know each other in real life. So | it's not only good for finding alternate accounts but could be | used to find peer groups. | [deleted] | the_cat_kittles wrote: | pretty cool- i think there should be a term for two accounts that | have each other as the top most similar account. kinda sad i dont | have one :( | layer8 wrote: | Stylotwins? | philosopher1234 wrote: | We're pretty close me and you -- closer than my actual alts | the_cat_kittles wrote: | hello friend! but... id never use an m dash | philosopher1234 wrote: | Well... I would never use a lowercase word after an | exclamation point! | | ...Because I'm on mobile | balls187 wrote: | No alt, and the highest match is 0.36 | | And that accounts last several comments were flagged as dead. | | I'm a native speaker, but my english succcccks. | rcarr wrote: | One way to get around this legitimately would be by posting a lot | of quotes/lyrics/excerpts and the like thus fooling the algorithm | unless it had a way to filter them out | Fnoord wrote: | Cool stuff, thank you for sharing your findings! | | I don't do throwaway. I either post or STFU. I also STFU on | darknet. Its why I found it fun to read/lurk on things like I2P | back when it was new. And I know that on a pseudonymous account | it is only a matter of time until it can be linked to another | pseudonymous account. It would not surprise me if stylometry was | used on Dread Pirate Roberts or the people behind The Pirate Bay | or the people behind Wikileaks (Assange's sockpuppet accounts). | Such can also have been used to verify afterwards instead of | beforehand. Though with TPB since it was on clearweb an advanced | adversary could have used correlation/timing attack to figure who | wrote what. | | I'm having fun times recognizing other Dutch people though their | usage of English language. For example, a distinctive word I see | Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a | red flag the person is native Dutch. I wonder if there are | stylometry tools available for figuring if someone used physical | vs touchscreen keyboard (I used Glider to write this post, | spellchecker unavailable). | | And yes, organizations like secret service and police should use | such tools as well. It is a known tool, why not use it for good? | As with any tool, it can be used for good and evil. On HN this | could be useful for the mod team (AFAIK nowadays only dang) to | find banned people's sockpuppets. Cross-community could also be a | fun project: find a HN user's Twitter or Reddit account. And I | hope this method is also used to find Russian trolls on social | media. | ghaff wrote: | Most people greatly underestimate the power of linkage attacks | on anonymity. And it doesn't even take fancy ML. In the context | of healthcare records, I like to trot out this 25 year old | example of an MIT grad student and the then-governor of MA. | | https://ischoolonline.berkeley.edu/blog/anonymous-data/ | dvh wrote: | Make a fundraiser and start doing it for other sites. | costco wrote: | It would be possible for Reddit because Pushshift.io archives | all the comments there and Reddit is still pretty small. I'd | probably need to make things a lot faster. Doing it on a | specific subreddit would be very feasible. I'll think about it | but I don't actually know if I really want to do that because | for instance I've been banned from subreddits before but I | don't want a ban from when I was 12 years old to follow me | around forever because my writing style hasn't changed. | Moderation is the most obvious application of this kind of | software. | rand_user_100 wrote: | > I'll think about it but I don't actually know if I really | want to do that because for instance I've been banned from | subreddits before but I don't want a ban from when I was 12 | years old to follow me around forever | | Insightful that your personal experience and impact on you | personally affects your decision. I invite you to think about | the impact of the products you build in your CS career by | putting yourself in the shoes of other people as well. | | Some products should not be built, even though it's easy to | build them. | DenisM wrote: | How about this for countermeasure: | | As you're typing out a comment the software gives you a list of | accounts you're becoming similar to. That way you can adjust your | writing as you type. | bornfreddy wrote: | Sounds great, except there are many different similarity | measures. Which one does the algorithm use? | wizzwizz4 wrote: | Why not all of them? Which metrics are closer would tell you | which aspects of your writing you need to focus on. | kaba0 wrote: | Someone linked it in the thread: | https://github.com/psal/anonymouth | pessimizer wrote: | Forget countermeasures, go covert. Write a comment, have the | comment be rewritten before submission in order to resemble a | targeted account. | Bhurn00985 wrote: | Just a heads up that for everyone who doesn't like to link their | alt accounts, maybe not use this tool to see if it works. | | Unless the author would run this against all HN user accounts, no | need to flag the ones "of interest". | jl6 wrote: | Rebrand it as a soulmate-finder? | tomrod wrote: | Is it weird that my rating is very low compared to alternative | options? I have no alts, but I'm curious how similar others might | write to me. | JKCalhoun wrote: | The asymmetry is interesting. I have no alts but of course it | nonetheless reported accounts similar to mine. | | Running then the most similar person to my account did not put me | in _their_ top 20. | sitkack wrote: | I believe this is the | https://en.wikipedia.org/wiki/Friendship_paradox | throwboi123 wrote: | That's why I always use throwaway :) everywhere. Reddit. HN. | Twitter. Everywhere. I'll spam every site with my throwaways. | | Long live throwaways. | kaba0 wrote: | That's the point of this post, that you are not safe by | throwaways at all, because all of your throwaways can be linked | together purely by your textual style. | silasdavis wrote: | > imagine what a company with millions of dollars and a couple | dozen PhD linguists could do. | | Could they do much better? | aaron695 wrote: | [deleted] | spaniard89277 wrote: | I changed my nickname so my employer can't find me here. I'm not | amused by this. | bee_rider wrote: | If this basic implementation can catch you, I'd consider it a | friendly reminder that changing your account name is not a very | effective means of adding privacy. | googlryas wrote: | New account, then translate your comments to Spanish and then | back to English using Google translate. | aryc19 wrote: | So what are some good tools to obfuscate style? | setr wrote: | Forget the alternate accounts -- if two users are close in style, | there's a decent chance they should be friends. This is an HN | friendship machine. | AviationAtom wrote: | Now I can find my HN doppelganger | sillysaurusx wrote: | Ha, gruseom shows up for pg, which is dang's old account. A | worthy successor. | | This is a fascinating way to find similar HN users who aren't the | same person. It's a surprisingly great recommendation engine. "If | you like pg, you might also like..." | | Sure, the privacy concerns are valid, but the cat's out of the | boot. Might as well enjoy the benefits. | | montrose is almost definitely pg. Someone who talks about ancient | history, Occam's razor, VCs and startups, uses the phrase "YC | cos" (relatively uncommon), etc. | https://news.ycombinator.com/item?id=17112567 | | Nicely done. One of the best hacks I've seen in a long time. | costco wrote: | > motrose is almost definitely pg. Someone who talks about | ancient history, Occam's razor, VCs and startups, uses the | phrase "YC cos" (relatively uncommon), etc. | https://news.ycombinator.com/item?id=17112567 | | I had this hunch too. It's either pg or someone trying really | hard to be pg. | roughly wrote: | I mean, this is HN - | | > someone trying really hard to be pg | | describes half the site. | asveikau wrote: | > Someone who talks about ancient history, Occam's razor, VCs | and startups, | | I think these are all common topics among HN readers and | commenters. | pyb wrote: | Why would montrose be pg ? The correlation is not that high. | Looks like a few people have picked up pg's mannerisms. | VyseofArcadia wrote: | > but the cat's out of the boot | | It's my first time hearing that variant. Usually its, "the | cat's out of the bag" where I'm from. | | Do you mean boot in the UK sense, what Americans would call the | trunk of a car? Or do you mean a sturdy piece of footwear? | | Obligatory xkcd https://xkcd.com/2390/ | sillysaurusx wrote: | It's a little writing trick I leaned from (I think) Orwell. | Any time you're about to use a common metaphor, try to tweak | it. You'll catch readers off guard, which piques their | curiosity. | | It's a fun game, too. I wish I'd used "the cat's out of the | hat," but I didn't think of it till later. | UncleEntity wrote: | Yeah, it's like shooting ducks in a barrel it works so | well. | | Easy to overuse then people just get annoyed though...kind | of like commas, I suppose. | esfandia wrote: | I like mixing metaphors, in this case "the cat's out of the | tube". ("the toothpaste's out of the bag" doesn't work as | well though) | InGoodFaith wrote: | What you are describing is also known as an eggcorn. | | https://en.wikipedia.org/wiki/Eggcorn | operator-name wrote: | That's neeto! | | The 2nd example also loosely falls under the | classification of malaphor. | | https://en.m.wiktionary.org/wiki/malaphor | sillysaurusx wrote: | Thank you! I was trying to find the original essay I | learned it from. I'm now pretty sure it was by Poe, but | all I can remember is the main advice: avoid common | metaphors. | | I vaguely remember one of the metaphors in the essay was | about a chicken coop melting, or something like that. It | was vivid enough to leave a big impression. | ewilden wrote: | I remember this being from Politics and the English | Language (https://www.orwellfoundation.com/the-orwell- | foundation/orwel...): | | " Dying metaphors. A newly invented metaphor assists | thought by evoking a visual image, while on the other | hand a metaphor which is technically 'dead' (e. g. iron | resolution) has in effect reverted to being an ordinary | word and can generally be used without loss of vividness. | But in between these two classes there is a huge dump of | worn-out metaphors which have lost all evocative power | and are merely used because they save people the trouble | of inventing phrases for themselves." | sillysaurusx wrote: | Thank you so much! That's the one. | | (It's remarkable how often a vague description can yield | an HN comment with an answer from a clever sleuth like | yourself. Much appreciated.) | sdwr wrote: | I love doing this too, it's fun to write. | [deleted] | kevmo314 wrote: | There's someone (michaelmior if you're around!) with a false | positive 0.46 match to me. | | Maybe we could be friends :) | drc500free wrote: | This is a super interesting tool for self reflection. Looking at | the top 10 similar accounts to mine, it gives me an arms-length | view of how other people probably interpret my tone. | | I appear to be a well-educated, over-confident know-it-all. | pavlov wrote: | My #3 match is cstross, and now I'm convinced that my life-long | secret dream of being a successful sci-fi novelist is basically | a matter of typing. (Ideas? Character development? Ruthless | editing? Developing an audience? Having a publisher? What do I | need of those when the Computer told me I'm practically a | genius...) | bee_rider wrote: | I also enjoyed reading one of my style-partner's posts. | | The most noticeable similarity is that we both clearly have | strong opinions about some things, and like to share | information, but also like to be clear about our unknowns or | opinions. So, lots of "sounds likes," "probably," "could be" | and so on. | | The downside is, I guess, this could be seen as a bit weasel- | word-y or indirect. | seydor wrote: | we must be a good match | bhaney wrote: | > I appear to be a well-educated, over-confident know-it-all. | | Don't we all? | sdwr wrote: | I hate us insufferable nerds. ! | closeparen wrote: | That's what we all come to HN for... | interroboink wrote: | This is one reason why I like legal doctrines such as "beyond a | reasonable doubt." Even a 0.9 match in a tool like this could be | a coincidence, if there are millions of users. But that won't | stop people from casually believing "aha it must be an alt | account", based on some anecdata. | | It's so easy for something like this to be turned into a tool for | a witch hunt, targeting innocents. | dsr_ wrote: | I like the way some usernames are only 0.9999999 correlated with | themselves. | | Perhaps 6 or 7 digits is enough? | rcarr wrote: | This is somewhat similar to how they ended up catching the | Unabomber. The FBI were literally at a dead end. They ended up | posting one of his letters/manifestos in the paper, somebody | recognised a turn of phrase the unabomber used that was unusual | and reported it as possibly being their brother, FBI investigated | the lead and it lead them straight to him. | | Excerpts from wiki: | | > Before the publication of Industrial Society and Its Future, | Kaczynski's brother, David, was encouraged by his wife to follow | up on suspicions that Ted was the Unabomber.[91] David was | dismissive at first, but he took the likelihood more seriously | after reading the manifesto a week after it was published in | September 1995. He searched through old family papers and found | letters dating to the 1970s that Ted had sent to newspapers to | protest the abuses of technology using phrasing similar to that | in the manifesto.[92] | | > In early 1996, an investigator working with Bisceglie contacted | former FBI hostage negotiator and criminal profiler Clinton R. | Van Zandt. Bisceglie asked him to compare the manifesto to | typewritten copies of handwritten letters David had received from | his brother. Van Zandt's initial analysis determined that there | was better than a 60 percent chance that the same person had | written the manifesto, which had been in public circulation for | half a year. Van Zandt's second analytical team determined a | higher likelihood. He recommended Bisceglie's client contact the | FBI immediately.[96] | | > In February 1996, Bisceglie gave a copy of the 1971 essay | written by Ted Kaczynski to Molly Flynn at the FBI.[87] She | forwarded the essay to the San Francisco-based task force. FBI | profiler James R. Fitzgerald[98][99] recognized similarities in | the writings using linguistic analysis and determined that the | author of the essays and the manifesto was almost certainly the | same person. Combined with facts gleaned from the bombings and | Kaczynski's life, the analysis provided the basis for an | affidavit signed by Terry Turchie, the head of the entire | investigation, in support of the application for a search | warrant.[87] | | https://en.m.wikipedia.org/wiki/Ted_Kaczynski | googlryas wrote: | It was actually his brother. | fbdab103 wrote: | So is the lesson you should have GPT rewrite your manifesto so | as to obscure your personal idioms? | CharlesW wrote: | Or something purpose-built like Anonymouth | (https://github.com/psal/anonymouth), although it seems to be | both unique and dead. | | Also interesting: | | > _Ross Ulbricht aka Dread Pirate Roberts, the mastermind | behind the infamous Silk Road site which served as a black | market for drugs, weapons and fake documents was also well | aware of the potential danger of stylometry being used | against him. At the time of his arrest in a San Francisco | public library, the FBI captured images of his laptop screen | as evidence. Guess what what he had bookmarked -- "Science of | Stylometry."_ | | https://medium.com/svilenk/the-case-for- | anonymity-12db114f0c... | rejectfinite wrote: | I mean he used an forum account with an email that had his | name in it. | fbdab103 wrote: | That's the problem - it only takes a single slip and it | is recorded forever. Perfect opsec is an impossibly high | bar if you are maintaining an active online presence. | elteto wrote: | Incredible! There was a very active throwaway account here a | while back that I always enjoyed interacting with. I suspected | the person had more than one account and this found one that is | incredibly close, down to the topics. | DrStrangeLoop wrote: | I tried dang's old account (gruseom) expecting to see his dang | account listed. Nothing. Tried dang, sctb (a previous admin) was | listed as closest match. | | I wouldn't rely on these results | | https://stylometry.net/user?username=gruseom | | https://stylometry.net/user?username=dang | pvg wrote: | _I wouldn 't rely on these results_ | | You picked a user who posts a massive volume of repeat, | template-y comments and found their former colleague who also | posted piles of repeat, template-y comments, that being part of | both of their jobs. | DrStrangeLoop wrote: | There are a few close matches to dang's style of template-y | comments in the results. Afaik none of the listed accounts | are Daniel. | | I picked dang as he is the figurehead of hn, and didn't want | to inadvertently reveal some other user's identity. | dragonwriter wrote: | > There are a few close matches to dang's style of | template-y comments in the results. | | At least the #1 close match (sctb) was a comoderator with | dang, so they were kind of alts as the official voice of | HN. | woodruffw wrote: | Neat work! | | Out of curiosity: do you filter sentences than begin with '>', | indicating a block quote from another user? That might improve | the accuracy a little here, if you don't already. | costco wrote: | Yep! | jsnell wrote: | After a few tries on boring accounts, I thought to try the | account of somebody who was notorious for an incident outside of | HN, and had a (deservedly) bad time at HN for a couple of years | before the account went dark. | | And yeah, there's a bunch of high confidence (.6-.8) hits for | that account, and from a quick browse of the comments of the | recently active ones, they look really likely to be alts. Like, | all three that I looked at had comments that made it very clear | it was this person writing pseudonymously. (E.g. writing on their | signature issue, and saying they couldn't go into more detail due | to fear of self-doxxing; or somebody literally saying that the | alt's claims reminded them of the public writings of the | notorious guy years ago). | | Obviously I'm not naming the account, but this functionality | turned out way creepier than I thought the moment I tried it on | the account of somebody who has a reason to disassociate from an | existing public persona, but still wants to participate here. | Animats wrote: | 0.6 isn't much. I have 3 matches above 0.6, and they're not me. | 20 or so over 0.5. | input_sh wrote: | I get one 0.68 match, which... fair enough. It is an account | I've abandoned some years ago, no secrets there. | | No other hits above 0.5, so I guess that either makes me | pretty unique as a commentator or my English is broken in a | unique way. | jsnell wrote: | That's why you manually evaluate the matches. And like I | wrote in that comment, I did that manual eval, and these | clearly are alts of that main account, not spurious. | Narrowing down the pool of accounts you'd need to do this | kind of manual evals for by a factor of 100000 is a pretty | significant change in capabilities. | tqi wrote: | > quick browse of the comments of the recently active ones, | they look really likely to be alts. | | Hmm isn't a spot check of comments somewhat tautological, since | that is how the tool identifies alts (rather than something | like IP address or time of day)? If this had been promoted as | "find accounts with similar writing style to yours" would | people immediately assume alts? | margalabargala wrote: | I would presume that OP is referring to the actual content of | the comments. This just does stylometric analysis, which | looks at word choice, but not what the arrangement of the | words _mean_. | | If some accounts are found to be stylometrically similar, and | then a visual inspection also shows them all stating similar | opinions, that latter piece of data is a strong signal. | thesz wrote: | I keep no alternate accounts, but this tool reports best | matches for me that appear to be Slavic or just Russian - and I | am Russian. Best match score in my list is just above 0.5. | There are some clearly alternate accounts on the list, their | match scores with this tool are well above 0.7. | | It is probable that persons of same cultural origin will have | similar writing style and vocabulary. It is also probable that | persons of same cultural origin would have same relationships | with the world as a whole, they would like same things and | dislike other same things. | | So, in my opinion, it is possible that you have found not only | alternate accounts (score above 0.7), but accounts of people | with same cultural origin (ones that are around 0.6). | ricardobayes wrote: | My highest was 0.41 and the person writes nothing like me. I | guess I'm a unique snowflake after all. | jrumbut wrote: | I have a few in the low 0.5's and, honestly, they seem cool | and I want to meet them. | gilleain wrote: | my second highest hit (ie, third in the list) is gwern at | 0.45 who i'm fairly sure is not me. | scarmig wrote: | I was actually just looking at near hits for gwern and | found what's almost definitely a defunct alt for him. | gilleain wrote: | Well is certainly NOT me, that's for sure. | | On an unrelated topic, I'm starting a service to write | comments in the style of others to provide plausible | deniability for other alt accounts. Rates negotiable. | vbezhenar wrote: | There're 19 other accounts this tool finds similar to me. | Those are not my accounts. 0.46 - 0.56 are numbers. | bbarnett wrote: | You are fools, one and all! This tool's only purpose, is to | tag people who use it! | | Now they know just who cares about which alternate | accounts. They _know_! | | They freaking know, man! | | You have all fallen for their ploy. Fools! | thesz wrote: | I have no alternate accounts and visited the site out of | curiosity, because I used to worked in the domain like | this. | | What I found was worth visiting the site. Somehow notably | many accounts with (relatively) high similarity to mine's | are sharing at least one of my personal traits. | | Which is fascinating, to me. | | And I think is worth to be noticed by others - what and | how you write can disclose who you are. | TheOtherHobbes wrote: | It knows my IP now. | | (Or does it?) | neodypsis wrote: | It offers no privacy policy, so can't tell. | csa wrote: | Fwiw, and as gp mentioned, > 0.7 seems more likely to be | alt territory. | costco wrote: | I think people are sort of confused at what this tool is | supposed to be which I will concede is partially my fault. | The results of this tool are by themselves not indicative | of having an alternative account. It generates the 20 most | similar users for every single user on the site, regardless | of whether they have an alt or not (there's obviously no | way for me to know that for every single user). In your | case further investigation would reveal that none of those | accounts are yours. | thesz wrote: | It is a fun tool, I can assure you. It is just people | have found use case you haven't foreseen yourself. | | I think your tool should have internal embeddings for | each of the user. Also, most probably your tool uses | cosine similarity for a search. | | Thus, I would like to suggest a feature: recognize simple | arithmetic operations over user's embeddings, such as | "thesz - 2 * patio11". It will make things even more fun, | this way we can find users who are like me and much not | like patio11. Even simple additions and subtractions | would suffice. | | (an idea is taken from properties of word2vec embeddings) | | Your tool is thought provoking. What I discovered with it | made me think about my use of language and what other | languages (body, imagery, etc) I use differently because | of who I am. Which made me think about my favorite | underrated superhero Cypher [1] - would his innate | ability to understand languages make him best detective | ever? | | [1] https://en.wikipedia.org/wiki/Cypher_(Marvel_Comics) | | Thank you! | phreeza wrote: | MD5 of the username is 9abc27e93b7e3c04b7c599017c1cfe5f ? The | top one seems an odd one out in that case? | Aachen wrote: | Usernames aren't random enough to be safe as a simple MD5. | Perhaps with a strong bcrypt, but similar to PIN codes, it | might be better to give partial information like "is the | second character an ...", assuming nobody else made similar | statements. Or give the first ~two hex characters of the | hash, so that it would match 1/(162)rd of the usernames. I'm | sure there's also a clever way for a zero-knowledge proof | here, probably something with diffie-hellman using the name | as your random integer or something, but I'm too sick to | think about this stuff right now. Privately sharing data | publicly is hard. | lzooz wrote: | Good point - I've been running john on that md5 for a | couple minutes :) | wizzwizz4 wrote: | Why use John? Just run down the list of Hacker News | usernames; it'll take less time. (Or, better still, | don't; just because the privacy's theoretically | compromised doesn't mean we have to exploit that.) | lzooz wrote: | I don't think there's a public list of all HN usernames | is there? | | Found this, it includes 250k usernames, but it's not | there. https://www.kaggle.com/datasets/hacker- | news/hacker-news-corp... | [deleted] | meta2023 wrote: | The username in question isn't in this dataset but maybe | it was created in the past 10 days, as the max(timestamp) | is Nov 16th, 2022. | | https://console.cloud.google.com/marketplace/details/y-co | mbi... | ahmedalsudani wrote: | Another problem is that it's a small set. If you had a list | of all HN users, you could compute md5 for all of them in | seconds. | [deleted] | kcarter80 wrote: | Could you elaborate on why it's obvious why you won't name the | account? | notduncansmith wrote: | Maybe to avoid attracting any extra attention to this user? | Also, as someone who's read HN for a few years, it only took | me 2 guesses to find an account that the above comment | describes (and not necessarily the same person). | [deleted] | sillysaurusx wrote: | It was a classy move by jsnell, too. Thank you. | | (I don't know who the comment is talking about, which is | how it should be. There's no need to blow someone's cover | in a highly visible way. Even if they were satan, they'd | still be welcome on HN as long as they're writing | substantive, interesting comments that follow the | guidelines.) | Normal_gaussian wrote: | Such quality comments would track with most thorough | Satan representations. | Aachen wrote: | They obviously don't want it to be known, seeing as they've | got alts to post under and avoid going into too much detail. | Being able to go out and do your own research is different | than posting the information open for everyone to see at a | glance. | | I would say it's obvious why one might respect that wish (do | unto others...), but I'm also aware that my and my culture's | sense of privacy goes further than many others'. | tbrownaw wrote: | > _but this functionality turned out way creepier than I | thought the moment I tried it_ | | Hopefully this raised awareness means that people who actually | need anonymity will be more likely to know to take precautions. | kaba0 wrote: | Genuinely asking, what way is there to combat this? Is there | a tool that takes out stylistic elements of your comment? | thedragonline wrote: | I wonder if gpt3 has a use case here? | marbu wrote: | One way would be to run such tool before posting and then | based on the results, tweak the post and repeat until the | similarities are not statistically significant. Or instead | of tweaking, start posting under a new throwaway account. | But this won't save you when some new way to analyze style | appears in the future. Moreover there are other types of | meta data which can be taken into account to narrow down | the search space a bit such as timestamps. And obviously | more you write, harder it is to control these things. | paulgb wrote: | The site mentions a service called Quillbot which | apparently does just that. https://stylometry.net/avoid | birdyrooster wrote: | UncleEntity wrote: | You know everyone going to put your username in that tool | after a rant like that. | | If ever there were a good use for a throwaway account I'm | thinking this is it... | irrational wrote: | .6 is high confidence? I did my own username, wondering what it | would return, since I know I don't have any alt accounts. The | top results are in the .6-.7 range. If they aren't alt | accounts, is it just coincidence that we have similar writing | styles? | bee_rider wrote: | I think so. | | A funny thought -- my "matches" cap out at around .56. Having | false positives* in a tool like this might feel like a "bad | result" but actually I think it just means that if someone | were running this sort of tool across the whole internet, I'd | be relatively easy to correlate, while your identity would be | intermingled with your .6-.7 partners. | | *actually they aren't really even false positives because the | tool doesn't promise to detect alts in the first place, just | find similar styles. | joxel wrote: | ColinWright is Dang? | | Woah | McDyver wrote: | Would this work for Fernando Pessoa and all his heteronyms? :) | jll29 wrote: | The method used, i.e. to calculate the cosine of the two authors' | word vectors, is poorly suited for stylometric analysis because | it is based on a poster's lexicon and the word frequencies of | each word, but ignoring stylistically relevant factors like word | order. | | Also, the cosine of the vectors of word frequencies conflates | author-specific vocabulary and topics; in other words, my account | is grouped (with >51% similarity, according to the demo) with | someone probably because we wrote about similar things. A strong | stylometric matcher ought to be robust against topic shifts (our | personal writing style is what stays constant when we move from | writing about one topic to writing about another topic, just like | our personality is what stays constant about our behavior over | time - of course styles do change, but the premise then has to be | that such changes happen very slowly). | | Stylometrics/authorship identification is interesting and has led | to some surprising findings, e.g. in forensic linguistics | (Malcolm Coulthard wrote several good books about the topic). | | This paper lists some other features that could be used and | compares a bunch of techniques: | https://research.ijcaonline.org/volume86/number12/pxc3893384... | agumonkey wrote: | Oh god, that thing starts with direct focus on the search field, | opening it showed a bunch of old nicknames, I thought it was the | result of some study. | rand_user_100 wrote: | On one hand, thank you for showing us all how easy it is to make | something like this. No doubt organizations with more resources | already have more sophisticated systems in the same vein. | | On the other hand, can we agree that this product is unethical? | | In many cases, when a person uses an alt, it is a direct and | strong signal that they do not wish their other posts to be | associated. | | So this product is circumventing the explicit will of the person, | and making it available to anyone with zero effort i.e. there is | no barrier to getting this info. | | I met someone about 10 years ago who said they built this at a | university. And their argument also was "actually this enhances | privacy because it lets you know something something something". | And yet their research grants were coming from one source only. | | It _can_ be used for good, but most often it won 't. | A4ET8a8uTh0 wrote: | << On the other hand, can we agree that this product is | unethical? | | It does create a high level of discomfort, because it | illustrates well what privacy advocates try talking about to | the population at large, but all that said.. how is it any | different from regular scraping and analyzing it any other way? | | This is a real question. | rand_user_100 wrote: | It's different because you're removing all barriers to access | and making it easy and convenient to stalk/dox people. | | Imagine you get the urge to track someone, but in order to do | that you have to spend a week writing some new software. | That's a barrier. And because of it you may change your mind | because it's a lot of work with little payoff. | | But if that info is just one click away, it's a whole | different ballgame. | [deleted] | dragonwriter wrote: | > On the other hand, can we agree that this product is | unethical? | | No. | gus_massa wrote: | It would be nice to make the names clickable. | | I don't think the list of pg alternate account is accurate. I | checked a few. They have many oneliners that is typical of pg, | but the topics and style don't look similar. | | I searched a few more and got better results. :) | | I searched myself (that I know that I have no alternate | accounts). I recognize a few users that are interested in similar | topics, and I discuss/upvote them many times. But I didn't | recognize most of the user of the list. | costco wrote: | > I searched myself (that I know that I have no alternate | accounts). I recognize a few users that are interested in | similar topics, and I discuss/upvote them many times. But I | didn't recognize most of the user of the list. | | It's based purely off frequency of the 200 most common English | 1 word phrases, 2 word phrases, 3 word phrases, 1 character | sequences, 2 character sequences, and 3 character sequences. | Topic does not really have anything to do with it. If I had | more time I probably would've done a smarter model that | accounted for things like that. | gus_massa wrote: | One is also a mathematician. It's trivial that we overuse | some technical words even if it's unnecessary. | | Another is form Argentina, so I guess the native language | leaks, for example using words derived from latin that are | not idiomatic. | | And there are a few more, that is a honor to be "confused" | with, but I have no clue why. | gavinray wrote: | I've complained a lot about Haskell and now it thinks I like | Haskell =( | | Needs sentiment analysis IMO, otherwise you'll get "Here's a | bunch of people who are JUST LIKE YOU", except they use a similar | grammar style but hold opposite opinions on the same nouns. | ahmedalsudani wrote: | Serves you right for disparaging The One True Language! | | Ok, fine, we'll present Idris with a fig leaf. | layer8 wrote: | It just thinks you engage a lot with Haskell. These are people | with who you have something to talk about. :) | chronogram wrote: | Well done, it found my ancient old account. | [deleted] | scotty79 wrote: | Funny thing would be to find most unique user account | stylistically. | | Which user has lowest best match? | | Mine is 0.58 so I'm really not that unique. | ggerganov wrote: | I really liked the informative and straight-to-the-point about | page - describing how the algorithm works in a way that is easy | to understand. All the important details are summarised there. | Well done! | | Edit: From the "How to avoid .." page, there is the following | sentence: | | > Also, most authorship identification algorithms have poor | accuracy when working with small amounts of words. This means the | optimal strategy would be discarding an account either after | every comment or after a small number of comments. Unfortunately, | this is against HN rules and may result in a ban. | | Can you clarify what this means and why it would result in a ban? | costco wrote: | > Can you clarify what this means and why it would result in a | ban? | | I have seen dang respond to users multiple times asking them to | stop making new accounts especially but not always if it's to | avoid rate limiting. I don't know if there's an official policy | but it's definitely something I recall. | krisoft wrote: | > Can you clarify what this means | | Imagine that for every new comment you want to post you would | create a brand new account which you would use precisely once | and never again. Then the stylometry would have just a few | words and wouldn't have enough corpus to get a reliable | signature. If a lot of people does this it would be hard to | figure out which account belongs with which human. ( Of course | if you alone do this, your messages will stick out like a sore | thumb. See xkcd 1105 ) | | > why it would result in a ban? | | Because this practice is especially discouraged in the | guidelines: "please don't create accounts routinely. HN is a | community--users should have an identity that others can relate | to." | stupendous_luck wrote: | At the same time, HN doesn't let you delete comments. | | Maybe with some GDPR magic. | krisoft wrote: | Not sure what is your point, or how does that connect with | my comment. Care to elaborate? | stupendous_luck wrote: | Your comment quotes an HN guideline, and my point relates | to it. Some users may feel the need to create throwaway | accounts in order to post comments that in an alternative | reality they could post under their primary account and | later delete if desired. It may not stop a scrupulous | collector of data, but such a scenario may not be the | object of their worry. | | Drawing this into the logical conclusion, a user may opt | to always post under a throwaway account, to avoid any | possible tainting associated with a primary account. | jaredsohn wrote: | Amusingly can't run it on the author since not enough comments | joshstrange wrote: | Very interesting, .59 is my lowest, .64 is my highest match, none | of these accounts are one of my alts. Though to be fair the | handful of times I've used a throwaway I used it for a single | comment so I didn't give it much to go off. | sedatk wrote: | I have no alternate accounts, and all my matches are below 0.4 | for whatever it's worth. | SevenNation wrote: | > ... This site works primarily by analyzing for each user the | frequencies of the most common words and phrases in the English | language. Accordingly, the easiest way to avoid being identified | is to simply use different words than you ordinarily would when | writing. More sophisticated models than the one I made can use | punctuation, comma usage, and capitalization to identify you so | try alternating those as well. Services like Quillbot can help | with you this but depending on your circmstances you may not want | to send your writings to a third party service. | | HN offers many other threads which could be tied together, | including: | | - time of posting | | - ratio of replies to top-level comments | | - comments being mainly upvoted or downvoted | | - sentiment (mostly angry, dismissive, questioning, etc.) | | - most common topics (keyword analysis of post being replied to) | | - ratio of new posting to post replies | | - first-to-comment on a post | | - lone comment on a post | | - etc... | | It seems very likely that sooner or later every pseudonym for | posting content will get discovered and linked. The lesson here | is don't post anything that would cause you undue shame or harm | if linked directly to your legal name. ___________________________________________________________________ (page generated 2022-11-26 23:00 UTC)