[HN Gopher] Entity Resolution: The most common data science chal... ___________________________________________________________________ Entity Resolution: The most common data science challenge Author : jonpon Score : 56 points Date : 2022-06-01 17:45 UTC (5 hours ago) (HTM) web link (docs.magniv.io) (TXT) w3m dump (docs.magniv.io) | visarga wrote: | I was expecting the use of sentence-transformers (sbert.net). If | you have a long list of entities you could use an approximate | similarity search library such as annoy. The authors store the | embeddings in a database and decode json for each comparison. | Very inefficient in my opinion. At least load the whole table of | embeddings in a np.array from the start, np.dot is plenty fast if | your list not huge. | | The problem is still not solved. Having a list of the most | similar entities does not tell you which are similar and which | are just related. You need a classifier. For that you can label a | few hundred pairs of positive and negative examples and use the | same sbert.net to finetune a transformer. The authors use the | easier route of thresholding cosine similarity score at 0.8, but | this threshold might not work for your case. | fastaguy88 wrote: | Sorry to post on at topic I know nothing about. | | To me, this looks very similar to local sequence similarity | search (e.g. BLAST), where there are very rapid methods that use | tuple-lookup and banded alignment to quickly identify "homologs" | (the same entity). The nice thing about similarity searching | algorithms is that they give you a very accurate probability of | whether two strings are "homologous" (belong to the same entity). | Perhaps I have the scale wrong, but it is routine to look for | thousands of queries (new entities) among hundreds of millions of | sequences (known entities) in an hour or so (and sequences are | typically an order of magnitude longer than entity names). The | problem is embarrassingly parallel, and very efficient algorithms | are available for entities that are usually very similar. | smeeth wrote: | Danger awaits all ye who enter this tutorial and have large | datasets. | | The tutorial is fun marketing material and all but its FAR too | slow to be used on anything at scale. Please, for your sanity, | don't treat this as anything other than a fun toy example. | | ER wants to be an an O(n^2) problem and you have to fight very | hard for it not to turn into one. Most people doing this at scale | are following basically the same playbook: | | 1) Use a very complicated speed optimized non-ml algorithm to | find groups of entities that are highly likely to be the same, | usually based on string similarity, hashing, or extremely | complicated heuristics. This process is called blocking or | filtering. | | 2) Use fancy ML to determine matches within these blocks. | | If you try to combine 1 & 2, skip 1, or try to do 1 with ML on | large datasets you are guaranteed to have a bad time. The | difference between mediocre and amazing ER is how well you are | doing blocking/filtering. | jandrewrogers wrote: | So much this. I work on spatiotemporal entity resolution and it | requires extreme care and cleverness to not turn it into a | proverbial "boiling the ocean" compute problem for all but the | most trivial cases. The implied state that needs to be managed | alone is often effectively intractable. At one time | supercomputers were used for large-scale entity resolution | problems; they may still. | pfisherman wrote: | Great advice! What is your rough guesstimate - in big O - of a | general least upper bound of what you can do with ER? | smeeth wrote: | Caveat: just my personal hunch. | | My suspicion is O(kn), where k is both constant and rather | large. This would be highly reliant on an extremely good | blocking algorithm. Maybe trending towards O(nlogn)? I don't | know. It's really tough. | | The blocking algorithm for this hypothetical solution would | most likely be an approximate hashing algorithm sort of | resembling a bloom filter. The input data would have to be | amenable to hashing, like a single string field. Add multiple | fields and shit just got complicated fast. Look up q-gram | blocking if you want to read a sort of simple example of a | blocking algorithm like this. | benmanns wrote: | Thinking about this, I think you would end up with | something like O(s^2*b) where s is the size of your biggest | block (e.g. all the "Smith"s or company names containing | "Inc." if that doesn't get removed), and b is the number of | blocks. This would be at least n, so O(kn) with a big k | makes some sense. If you have some way to get away with not | comparing every element in a block to every other element, | e.g. exact string match only I think you could get away | with O(s*log(s)*b), but most of the time the matches are | not going to be good enough. | jonpon wrote: | You are 100% correct, this is a toy example that I decided to | put together for fun after talking to a bunch of people who | mentioned it as a problem that they experience. | | The main idea I wanted to add to the discussion (which is not | that crazy of an addition) is that you can possible use | sentence embedding instead of fuzzy matching on the actual | letters to get more "domain expertise" | | How to actually compare these embeddings with all the other | embeddings you have in a large dataset is a problem that is | completely out of the scope of this tutorial | smeeth wrote: | Yeah, totally, I get you. I'm not trying to do a takedown, ER | is just hard. | | The point I was trying to make is that at scale one does not | simply: | | > compare these embeddings with all the other embeddings you | have | | You just can't, similarity metrics (especially cosine) on 768 | dim arrays are prohibitively slow. | | Using embeddings is quite common in the literature and in | deployment (I have, in fact, deployed ER that uses | embeddings), but only as part of #2, the matching step. In | many projects, doing full pairwise comparisons would take on | the order of years, you have to do something else to refine | the comparison sets first. | staticassertion wrote: | Cool, TIL about blocking/ filtering. We do entity resolution | such that if we have a "process id" (ex: 1234) in two different | event logs we can determine if they're the same process (since | pids get reused). We don't have to compare every process to | every other process, only ones that share the same pid, which | drastically reduces the data size, and then we do filtering on | top of that based on other optional attributes. | | When I wrote this system I didn't know what ER even was, an | article like this would have helped a lot, even just the first | line _defining_ ER would have helped a lot. | ropeladder wrote: | Entity resolution/record linkage/deduplication is an oddly | specialized domain of knowledge given that it's such a common | problem. I put together a page of resources a while back if | anyone is interested: https://github.com/ropeladder/record- | linkage-resources | chrisoakman wrote: | I gave a talk about Record Linking at Clojure/conj 2019: | https://www.youtube.com/watch?v=rGKEOMUtJfE | | I opened a PR to add it to your list | ropeladder wrote: | Merged! Thanks for the addition, looks like an interesting | talk with some good real-world lessons. | stevesimmons wrote: | A good starting point for Entity Resolution/Deduplication is the | Python Dedupe project [1, 2] and the PhD thesis on whose work it | is based [3] | | [1] https://github.com/dedupeio/dedupe | | [2] https://dedupe.io/ | | [3] http://www.cs.utexas.edu/~ml/papers/marlin- | dissertation-06.p... | benmanns wrote: | What are people doing with entity resolution/record linkage? At | Doximity we use it to match messy physician, hospital, and | medical publication data from various sources into more coherent | data sets to power profiles and research tools. Mostly with | https://dedupe.io/ but with some custom tooling to handle larger | (1m+ entities) datasets. | nwatson wrote: | Replacing sensitive entity content with pseudorandom seeded | junk for subsequent training and transforms in exposed | settings. Not the main use case for ER. ___________________________________________________________________ (page generated 2022-06-01 23:01 UTC)