[HN Gopher] Entity Resolution: The most common data science chal...
       ___________________________________________________________________
        
       Entity Resolution: The most common data science challenge
        
       Author : jonpon
       Score  : 56 points
       Date   : 2022-06-01 17:45 UTC (5 hours ago)
        
 (HTM) web link (docs.magniv.io)
 (TXT) w3m dump (docs.magniv.io)
        
       | visarga wrote:
       | I was expecting the use of sentence-transformers (sbert.net). If
       | you have a long list of entities you could use an approximate
       | similarity search library such as annoy. The authors store the
       | embeddings in a database and decode json for each comparison.
       | Very inefficient in my opinion. At least load the whole table of
       | embeddings in a np.array from the start, np.dot is plenty fast if
       | your list not huge.
       | 
       | The problem is still not solved. Having a list of the most
       | similar entities does not tell you which are similar and which
       | are just related. You need a classifier. For that you can label a
       | few hundred pairs of positive and negative examples and use the
       | same sbert.net to finetune a transformer. The authors use the
       | easier route of thresholding cosine similarity score at 0.8, but
       | this threshold might not work for your case.
        
       | fastaguy88 wrote:
       | Sorry to post on at topic I know nothing about.
       | 
       | To me, this looks very similar to local sequence similarity
       | search (e.g. BLAST), where there are very rapid methods that use
       | tuple-lookup and banded alignment to quickly identify "homologs"
       | (the same entity). The nice thing about similarity searching
       | algorithms is that they give you a very accurate probability of
       | whether two strings are "homologous" (belong to the same entity).
       | Perhaps I have the scale wrong, but it is routine to look for
       | thousands of queries (new entities) among hundreds of millions of
       | sequences (known entities) in an hour or so (and sequences are
       | typically an order of magnitude longer than entity names). The
       | problem is embarrassingly parallel, and very efficient algorithms
       | are available for entities that are usually very similar.
        
       | smeeth wrote:
       | Danger awaits all ye who enter this tutorial and have large
       | datasets.
       | 
       | The tutorial is fun marketing material and all but its FAR too
       | slow to be used on anything at scale. Please, for your sanity,
       | don't treat this as anything other than a fun toy example.
       | 
       | ER wants to be an an O(n^2) problem and you have to fight very
       | hard for it not to turn into one. Most people doing this at scale
       | are following basically the same playbook:
       | 
       | 1) Use a very complicated speed optimized non-ml algorithm to
       | find groups of entities that are highly likely to be the same,
       | usually based on string similarity, hashing, or extremely
       | complicated heuristics. This process is called blocking or
       | filtering.
       | 
       | 2) Use fancy ML to determine matches within these blocks.
       | 
       | If you try to combine 1 & 2, skip 1, or try to do 1 with ML on
       | large datasets you are guaranteed to have a bad time. The
       | difference between mediocre and amazing ER is how well you are
       | doing blocking/filtering.
        
         | jandrewrogers wrote:
         | So much this. I work on spatiotemporal entity resolution and it
         | requires extreme care and cleverness to not turn it into a
         | proverbial "boiling the ocean" compute problem for all but the
         | most trivial cases. The implied state that needs to be managed
         | alone is often effectively intractable. At one time
         | supercomputers were used for large-scale entity resolution
         | problems; they may still.
        
         | pfisherman wrote:
         | Great advice! What is your rough guesstimate - in big O - of a
         | general least upper bound of what you can do with ER?
        
           | smeeth wrote:
           | Caveat: just my personal hunch.
           | 
           | My suspicion is O(kn), where k is both constant and rather
           | large. This would be highly reliant on an extremely good
           | blocking algorithm. Maybe trending towards O(nlogn)? I don't
           | know. It's really tough.
           | 
           | The blocking algorithm for this hypothetical solution would
           | most likely be an approximate hashing algorithm sort of
           | resembling a bloom filter. The input data would have to be
           | amenable to hashing, like a single string field. Add multiple
           | fields and shit just got complicated fast. Look up q-gram
           | blocking if you want to read a sort of simple example of a
           | blocking algorithm like this.
        
             | benmanns wrote:
             | Thinking about this, I think you would end up with
             | something like O(s^2*b) where s is the size of your biggest
             | block (e.g. all the "Smith"s or company names containing
             | "Inc." if that doesn't get removed), and b is the number of
             | blocks. This would be at least n, so O(kn) with a big k
             | makes some sense. If you have some way to get away with not
             | comparing every element in a block to every other element,
             | e.g. exact string match only I think you could get away
             | with O(s*log(s)*b), but most of the time the matches are
             | not going to be good enough.
        
         | jonpon wrote:
         | You are 100% correct, this is a toy example that I decided to
         | put together for fun after talking to a bunch of people who
         | mentioned it as a problem that they experience.
         | 
         | The main idea I wanted to add to the discussion (which is not
         | that crazy of an addition) is that you can possible use
         | sentence embedding instead of fuzzy matching on the actual
         | letters to get more "domain expertise"
         | 
         | How to actually compare these embeddings with all the other
         | embeddings you have in a large dataset is a problem that is
         | completely out of the scope of this tutorial
        
           | smeeth wrote:
           | Yeah, totally, I get you. I'm not trying to do a takedown, ER
           | is just hard.
           | 
           | The point I was trying to make is that at scale one does not
           | simply:
           | 
           | > compare these embeddings with all the other embeddings you
           | have
           | 
           | You just can't, similarity metrics (especially cosine) on 768
           | dim arrays are prohibitively slow.
           | 
           | Using embeddings is quite common in the literature and in
           | deployment (I have, in fact, deployed ER that uses
           | embeddings), but only as part of #2, the matching step. In
           | many projects, doing full pairwise comparisons would take on
           | the order of years, you have to do something else to refine
           | the comparison sets first.
        
         | staticassertion wrote:
         | Cool, TIL about blocking/ filtering. We do entity resolution
         | such that if we have a "process id" (ex: 1234) in two different
         | event logs we can determine if they're the same process (since
         | pids get reused). We don't have to compare every process to
         | every other process, only ones that share the same pid, which
         | drastically reduces the data size, and then we do filtering on
         | top of that based on other optional attributes.
         | 
         | When I wrote this system I didn't know what ER even was, an
         | article like this would have helped a lot, even just the first
         | line _defining_ ER would have helped a lot.
        
       | ropeladder wrote:
       | Entity resolution/record linkage/deduplication is an oddly
       | specialized domain of knowledge given that it's such a common
       | problem. I put together a page of resources a while back if
       | anyone is interested: https://github.com/ropeladder/record-
       | linkage-resources
        
         | chrisoakman wrote:
         | I gave a talk about Record Linking at Clojure/conj 2019:
         | https://www.youtube.com/watch?v=rGKEOMUtJfE
         | 
         | I opened a PR to add it to your list
        
           | ropeladder wrote:
           | Merged! Thanks for the addition, looks like an interesting
           | talk with some good real-world lessons.
        
       | stevesimmons wrote:
       | A good starting point for Entity Resolution/Deduplication is the
       | Python Dedupe project [1, 2] and the PhD thesis on whose work it
       | is based [3]
       | 
       | [1] https://github.com/dedupeio/dedupe
       | 
       | [2] https://dedupe.io/
       | 
       | [3] http://www.cs.utexas.edu/~ml/papers/marlin-
       | dissertation-06.p...
        
       | benmanns wrote:
       | What are people doing with entity resolution/record linkage? At
       | Doximity we use it to match messy physician, hospital, and
       | medical publication data from various sources into more coherent
       | data sets to power profiles and research tools. Mostly with
       | https://dedupe.io/ but with some custom tooling to handle larger
       | (1m+ entities) datasets.
        
         | nwatson wrote:
         | Replacing sensitive entity content with pseudorandom seeded
         | junk for subsequent training and transforms in exposed
         | settings. Not the main use case for ER.
        
       ___________________________________________________________________
       (page generated 2022-06-01 23:01 UTC)