[HN Gopher] PyWhat: Identify Anything
       ___________________________________________________________________
        
       PyWhat: Identify Anything
        
       Author : trueduke
       Score  : 183 points
       Date   : 2021-06-16 07:54 UTC (15 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | cosmic_quanta wrote:
       | In the same vague theme of "I don't know what I'm dealing with" :
       | https://github.com/ajalt/fuckitpy
        
         | kilnr wrote:
         | Another one sort of related is hachoir, and specifically the
         | hachoir-metadata script: https://github.com/vstinner/hachoir
        
         | 0-_-0 wrote:
         | I like the Versioning section:
         | 
         |  _The web devs tell me that fuckit 's versioning scheme is
         | confusing, and that I should use "Semitic Versioning" instead.
         | So starting with fuckit version h.g., package versions will use
         | Hebrew Numerals._
        
         | antongribok wrote:
         | I can't decide what I'm more impressed with:
         | 
         | The 110% code coverage, the downloads per month, or the
         | license.
        
           | bee_rider wrote:
           | I'm not sure if it was intentional or not, but I love that
           | the Hebrew characters that they found look visually similar
           | to Nan.
        
       | dec0dedab0de wrote:
       | At first I thought this was going to be like google lens. It's
       | instead a way to probabilistically Identify things in strings. I
       | have wished for this to exist, and made my own dumbed down
       | version of it before. This could be very useful for less fragile
       | screen scraping.
        
       | acidbaseextract wrote:
       | Some more great probabilistic python libraries:
       | 
       | https://github.com/datamade/usaddress - "usaddress is a Python
       | library for parsing unstructured address strings into address
       | components, using advanced NLP methods."
       | 
       | https://github.com/datamade/probablepeople - "probablepeople is a
       | python library for parsing unstructured romanized name or company
       | strings into components, using advanced NLP methods."
        
         | nerdponx wrote:
         | I have used and benefited tremendously from both of these
         | libraries. While the methods are sound, the training data they
         | used is not that comprehensive. He will probably want to apply
         | some heuristic clean up before and after processing. Or if your
         | organization has a lot of time and money, add additional
         | training data.
        
         | cge wrote:
         | Note that for the usaddress library, as I was surprised that it
         | failed spectacularly when I played with it: the 'us' in the
         | name appears to refer to the US, not 'unstructured'. There's no
         | note of this in the readme, though there is a small US flag
         | emoji in the Github about string.
        
         | ssivark wrote:
         | Nice! In the same spirit, here's an interesting talk on using
         | Gen.jl (a probabilistic programming library/framework) for
         | cleaning messy data in tables: https://youtu.be/vUxrtqY84AM
        
         | ok123456 wrote:
         | https://github.com/chardet/chardet - Detects the most likely
         | encoding of a raw byte string.
        
       | lapp0 wrote:
       | Why would I need this when I already have a full Tome of Identify
       | with 50 charges?
        
         | nknealk wrote:
         | Tome of identify only holds 20 charges
        
           | AbraKdabra wrote:
           | I'm pretty sure he's playing the Project Diablo II mod.
        
         | saas_sam wrote:
         | PyWhat only uses one inventory slot vs. 2 for Tome. That's one
         | extra SoJ!
        
       | lettergram wrote:
       | We built a similar tool, utilizing a CNN. It works on structured
       | (and unstructured) data and provides additional info.
       | 
       | https://github.com/capitalone/DataProfiler
       | 
       | Cool part, is you can "extend" the intern name-entity recognition
       | model by refitting with the new data.
       | 
       | Out if the box, the DataProfiler does something like 18 entities
       | including most of the PII dada.
        
         | [deleted]
        
       | gigatexal wrote:
       | There really is a Python module for everything.
        
       | cecilpl2 wrote:
       | Cool, but it seems like 80% of the results in your example demos
       | are Youtube video IDs.
        
         | Mogzol wrote:
         | I find it kind of funny that they would choose to show those as
         | demos when it's obvious that most of them really aren't Youtube
         | video IDs. Like "Accept-Lang" is pretty obviously not actually
         | a video ID, even if it matches the [A-Za-z0-9_-]{11} pattern
         | and technically could be a valid ID.
         | 
         | On the other hand, I don't know how you would actually verify
         | whether an 11-character string is or isn't a Youtube ID (short
         | of querying Youtube itself), so I suppose it's nice that
         | potential IDs are shown, just seems they have a very high
         | chance of being false positives.
        
           | meowface wrote:
           | You can reduce false positives by trying to identify
           | base64-seeming strings that are 11 characters long. Above a
           | certain amount of entropy and uppercase/lowercase/digit
           | distribution, etc. You might risk false negatives, but
           | different flags for different levels of sensitivity could
           | help with that.
        
       | vitus wrote:
       | I'm admittedly not impressed by the pcap processing.
       | 
       | It identifies a bunch of fragments of HTTP headers as "YouTube
       | Video ID".
       | 
       | Meanwhile, I can get the same info and more by running
       | $ strings FollowTheLeader.pcap         *]?>         GET /
       | HTTP/1.1         Host: 10.0.2.5         User-Agent: Mozilla/5.0
       | (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
       | Accept:
       | text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
       | Accept-Language: en-US,en;q=0.5         Accept-Encoding: gzip,
       | deflate         Connection: keep-alive         Upgrade-Insecure-
       | Requests: 1         Pragma: no-cache         Cache-Control: no-
       | cache         HTTP/1.0 200 OK         Server: SimpleHTTP/0.6
       | Python/3.7.3rc1         Date: Sun, 14 Jul 2019 02:42:13 GMT
       | Content-type: text/html         Content-Length: 105         Last-
       | Modified: Sun, 14 Jul 2019 02:41:10 GMT         <h1>My Flag Web
       | Page</h1>         <p>Hi there! Have a flag!</p>         <p>Here
       | is your flag: ctfa{terrific_traffic}</p>
        
       ___________________________________________________________________
       (page generated 2021-06-16 23:00 UTC)