[HN Gopher] PyWhat: Identify Anything ___________________________________________________________________ PyWhat: Identify Anything Author : trueduke Score : 183 points Date : 2021-06-16 07:54 UTC (15 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | cosmic_quanta wrote: | In the same vague theme of "I don't know what I'm dealing with" : | https://github.com/ajalt/fuckitpy | kilnr wrote: | Another one sort of related is hachoir, and specifically the | hachoir-metadata script: https://github.com/vstinner/hachoir | 0-_-0 wrote: | I like the Versioning section: | | _The web devs tell me that fuckit 's versioning scheme is | confusing, and that I should use "Semitic Versioning" instead. | So starting with fuckit version h.g., package versions will use | Hebrew Numerals._ | antongribok wrote: | I can't decide what I'm more impressed with: | | The 110% code coverage, the downloads per month, or the | license. | bee_rider wrote: | I'm not sure if it was intentional or not, but I love that | the Hebrew characters that they found look visually similar | to Nan. | dec0dedab0de wrote: | At first I thought this was going to be like google lens. It's | instead a way to probabilistically Identify things in strings. I | have wished for this to exist, and made my own dumbed down | version of it before. This could be very useful for less fragile | screen scraping. | acidbaseextract wrote: | Some more great probabilistic python libraries: | | https://github.com/datamade/usaddress - "usaddress is a Python | library for parsing unstructured address strings into address | components, using advanced NLP methods." | | https://github.com/datamade/probablepeople - "probablepeople is a | python library for parsing unstructured romanized name or company | strings into components, using advanced NLP methods." | nerdponx wrote: | I have used and benefited tremendously from both of these | libraries. While the methods are sound, the training data they | used is not that comprehensive. He will probably want to apply | some heuristic clean up before and after processing. Or if your | organization has a lot of time and money, add additional | training data. | cge wrote: | Note that for the usaddress library, as I was surprised that it | failed spectacularly when I played with it: the 'us' in the | name appears to refer to the US, not 'unstructured'. There's no | note of this in the readme, though there is a small US flag | emoji in the Github about string. | ssivark wrote: | Nice! In the same spirit, here's an interesting talk on using | Gen.jl (a probabilistic programming library/framework) for | cleaning messy data in tables: https://youtu.be/vUxrtqY84AM | ok123456 wrote: | https://github.com/chardet/chardet - Detects the most likely | encoding of a raw byte string. | lapp0 wrote: | Why would I need this when I already have a full Tome of Identify | with 50 charges? | nknealk wrote: | Tome of identify only holds 20 charges | AbraKdabra wrote: | I'm pretty sure he's playing the Project Diablo II mod. | saas_sam wrote: | PyWhat only uses one inventory slot vs. 2 for Tome. That's one | extra SoJ! | lettergram wrote: | We built a similar tool, utilizing a CNN. It works on structured | (and unstructured) data and provides additional info. | | https://github.com/capitalone/DataProfiler | | Cool part, is you can "extend" the intern name-entity recognition | model by refitting with the new data. | | Out if the box, the DataProfiler does something like 18 entities | including most of the PII dada. | [deleted] | gigatexal wrote: | There really is a Python module for everything. | cecilpl2 wrote: | Cool, but it seems like 80% of the results in your example demos | are Youtube video IDs. | Mogzol wrote: | I find it kind of funny that they would choose to show those as | demos when it's obvious that most of them really aren't Youtube | video IDs. Like "Accept-Lang" is pretty obviously not actually | a video ID, even if it matches the [A-Za-z0-9_-]{11} pattern | and technically could be a valid ID. | | On the other hand, I don't know how you would actually verify | whether an 11-character string is or isn't a Youtube ID (short | of querying Youtube itself), so I suppose it's nice that | potential IDs are shown, just seems they have a very high | chance of being false positives. | meowface wrote: | You can reduce false positives by trying to identify | base64-seeming strings that are 11 characters long. Above a | certain amount of entropy and uppercase/lowercase/digit | distribution, etc. You might risk false negatives, but | different flags for different levels of sensitivity could | help with that. | vitus wrote: | I'm admittedly not impressed by the pcap processing. | | It identifies a bunch of fragments of HTTP headers as "YouTube | Video ID". | | Meanwhile, I can get the same info and more by running | $ strings FollowTheLeader.pcap *]?> GET / | HTTP/1.1 Host: 10.0.2.5 User-Agent: Mozilla/5.0 | (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0 | Accept: | text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 | Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, | deflate Connection: keep-alive Upgrade-Insecure- | Requests: 1 Pragma: no-cache Cache-Control: no- | cache HTTP/1.0 200 OK Server: SimpleHTTP/0.6 | Python/3.7.3rc1 Date: Sun, 14 Jul 2019 02:42:13 GMT | Content-type: text/html Content-Length: 105 Last- | Modified: Sun, 14 Jul 2019 02:41:10 GMT <h1>My Flag Web | Page</h1> <p>Hi there! Have a flag!</p> <p>Here | is your flag: ctfa{terrific_traffic}</p> ___________________________________________________________________ (page generated 2021-06-16 23:00 UTC)