[HN Gopher] Ask HN: What is new in algorithms and data structure... ___________________________________________________________________ Ask HN: What is new in algorithms and data structures these days? Algs/DS were my first love in CS. Nowadays, all we hear about is AI/ML. There must be hardware/software improvements coming from or necessitating fundamental Algs/DS research. Care to share any of the favorite recent results? Are there big fields still making gains behind all this computing surge? Author : jvanderbot Score : 378 points Date : 2023-05-10 13:16 UTC (9 hours ago) | zelphirkalt wrote: | There is ongoing research for purely functional data structures | and many implementations to be written for many functional | languages, to make use of PFDS. Whenever one sees a traditional | data structure, one can ask oneself, what a persistent functional | one might look like, or whether a different solution would be | pursued in FP languages. Traditional data structures are often a | question of transcribing from an example language into a target | language, with variuos language specific changes for | optimization. Finding a good way to implement things without | mutation or side-effects, is a challenge. | revskill wrote: | Searching. | chc4 wrote: | In compilers, there's been a recent uptick in industry research | and adoption into using equivalency classes and graphs (egraphs) | for doing optimization passes in a way that preserves information | and solves the phase ordering problem. Egg[0] was one of the big | libraries that came out of it, but can only handle term rewriting | without guards, and so can't express y/x*x -> y because it's | unsound when x=0, or optimization passes that are across control | flow points and thus some of the eclasses are only valid in some | locations. The Cranelift developers adapted it into a | construction they call aegraphs[1] that handles this, and are | migrating Cranelift to use these passes for all optimizations, | which is (hopefully) faster and achieves more optimization | opportunitie; Chris Fallin is presenting about their aegraph work | at PLDI this year. | | 0: https://egraphs-good.github.io/ | | 1: | https://github.com/cfallin/rfcs/blob/main/accepted/cranelift... | tester756 wrote: | thanks, this is cool! | aseipp wrote: | The recent work on relational, datalog-inspired egraphs in PLDI | this year ("Unifying Datalog and Equality Saturation") is | actually interesting because it can solve cases like the y/x*x | -> y identity example, by the power of an interval analysis on | x (among other things.) Sort of like adding a postulate to a | rewrite term and then proving it. But instead it's by adding | relations between terms in the graph. See section 6.2 of the | paper. | | https://github.com/egraphs-good/egglog | | https://arxiv.org/pdf/2304.04332.pdf | chc4 wrote: | The other recent research that's been interesting to watch has | been in logic programming (Prolog and Datalog), where there's | been a lot of papers extending Datalog with semilattice types | that are more efficient for computing dataflow-type queries | (and even getting lattice types added to Souffle recently), | including papers like datafun[0] extending it to sound | incremental queries for more efficient execution - and there's | even a paper by the Egg authors using it for efficient eclass- | based Datalog queries using lattices[1]. It also seems like | there has been more work recently in actually integrating logic | programming in existing languages and programs, like ascent[2], | instead of it being off in it's own world like it seems has | historically been the case. | | 0: http://www.rntz.net/files/seminaive-datafun.pdf | | 1: https://www.mwillsey.com/papers/egglog | | 2: https://s-arash.github.io/ascent/cc22main-p95-seamless- | deduc... | ggleason wrote: | Yeah, it's really unfortunate that these amazing advances in | datalog don't make it into general purpose languages. | jorkadeen wrote: | You might be interested in Flix which has first-class | Datalog program values: | | https://flix.dev/ | | https://doc.flix.dev/fixpoints.html | | (I am one of the developers of Flix) | thaliaarchi wrote: | How does Datalog with e-graphs compare to Datalog with BDDs | in terms of what analyses can be performed? bddbddb is a | Datalog engine using BDDs, that, according to their 2005 | paper[0], can solve problems like context-sensitive pointer | analysis for large programs and analyses implemented with it | are faster than hand-tuned implementations in dramatically | fewer lines of code. I'm not knowledgeable on what modern | Datalog engines are based on. | | 0: https://people.csail.mit.edu/mcarbin/papers/aplas05.pdf | chc4 wrote: | I'm under the impression that BDDs are good for more | efficient representation of row membership for values, so | that you can query and saturate more facts faster, but | eclasses allow for writing rules that "automatically" have | transitive connectivity of two rules. If you have X -> Y | and Y -> Z, you only need to store one of X or Y :- Z since | X and Y have been unioned together and may be treated | identically, along with a same(X, Z) rule firing instead of | having to have additional transitive same(X, same(Y, Z)) | rules. I don't profess to be an expert in BDDs, eqsat, or | Datalog, though! I have one friend that did a thesis on | eqsat who wasn't that big of a fan of bddbddb but I don't | remember exactly why. | aaron695 wrote: | [dead] | mad wrote: | The Cuckoo Trie, an ordered index data structure explicitly | designed to have memory-level parallelism, which out-of-order | processors can exploit to execute DRAM accesses in parallel: | | https://arxiv.org/pdf/2201.09331.pdf | barbarr wrote: | I'm not in the field but it seems that SODA (ACM/SIAM Symposium | on Discrete Algorithms) is a top conference in DSA and might be a | good place to check. Here's the abstract book for 2022: [1]. | | [1] | https://www.siam.org/Portals/0/Conferences/About_SIAM_Confer... | vitorsr wrote: | - Streaming algorithms [1] | | - Eventual consistency [2] | | - Refinement types [3], [4] | | [1] https://www2.cs.uh.edu/~ceick/6340/Streams.pdf | | [2] https://www.microsoft.com/en-us/research/wp- | content/uploads/... | | [3] https://ranjitjhala.github.io/static/trust_but_verify.pdf | | [4] | https://ranjitjhala.github.io/static/refinement_types_for_ty... | NeuroCoder wrote: | Refinement types are pretty interesting and I've seen mention | of them recently. I'm curious if there are any practical | reasons we don't see them implemented in more languages. | klibertp wrote: | > Refinement types are pretty interesting | | Indeed. To me, they look like a formalized subset of | precondition contracts that can be checked statically. I | don't think the idea is new, I seem to recall Ada and Eiffel | having something like this, but I first learned about them | via Racket[1]. I still don't know how they work or what's | needed to implement them. | | Interestingly, Raku has a similar concept called `subset`, | eg. subset OneToTen of Int where * > 0 & * | < 10; | | but from what I know no static analysis is done on the | refinement/"where" part, it's a runtime assertion only. | | [1] https://blog.racket-lang.org/2017/11/adding-refinement- | types... | sireat wrote: | For Maximum Flow (and Minimal Cut) through flow network we are | getting close to linear time: | | 2022 Paper https://arxiv.org/abs/2203.00671 | cjalmeida wrote: | It's not particularly new, but recently I've been involved a lot | with solving large combinatorial optimization problems | | https://en.m.wikipedia.org/wiki/Combinatorial_optimization | | Algos to solve those both exact methods and heuristics are super | useful in real world applications, and very underappreciated by | the HN crowd IMO. | JZL003 wrote: | Yes! Me too, there's a whole world out there and it's codified | in bullet-proof programs like CPLEX or gurobi. Sadly they are | expensive and it's not obvious if a given problem's size and | complexity will tailor itself to some of their heuristics. | | Have you been doing it by hand or using a big program? | cjalmeida wrote: | At $DayJob, we deal with OR problems for large companies and | typically, when possible, we leverage Gurobi to solve Mixed | Integer Linear Programming (MIP) formulations. However it's | not unusual for the problems to be huge to the point even | Gurobi can't solve them in a reasonable time. And of course, | there's the case where the problem can't be formulated (or | approximated) as a linear program at all. | | So, in a couple projects I was involved we had to resort to | manually constructed heuristics such as Simulated Annealing | (SA) and Large Neighborhood Search (LNS). | | A very cool large scale approach we resorted in some problems | (eg. airline scheduling) was Column Generation[1]. Not only | it solved in minutes what took hours in the MIP | implementation, unlike SA and LNS it's and _exact_ method | that provide valid bounds to global optimality. | | [1]: https://en.wikipedia.org/wiki/Column_generation | kherud wrote: | Answer Set Programming is an incredibly powerful tool to | declaratively solve combinatorial problems. Clingo is one of | the best open source implementations in my opinion: | https://github.com/potassco/clingo - Based on | conflict driven nogood learning, it has incredible | performance (including good multithreading) - It has | different optimization and enumeration modes - It has | strong built-in heuristics, but also allows manual heuristics | - It allows incremental solving - It has a nice API to | integrate other solving paradigms (like difference logic and | constraint programming, for which implementations already | exist) - There is a huge ecosystem around it | | And much more, go check it out! | bjornsing wrote: | Any advice on how to get to work on this for a living? | mtsolitary wrote: | I think it depends a lot on the domain you are in. A example | of some domains with lots of discrete opt problems: | transport, scheduling, logistics. And then read stuff by | Stephen Boyd :) | mtsolitary wrote: | What makes you think they're underappreciated? I too have been | spending a lot of time in this space recently and agree it's | one of the more fertile "maths meets reality" spaces :) | Karrot_Kream wrote: | > Algos to solve those both exact methods and heuristics are | super useful in real world applications, and very | underappreciated by the HN crowd IMO. | | There just aren't many situations where they're useful. Time | scheduling, logistics, operations, there are only a handful of | places where these problems are so important that they need to | be solved optimally or nearly optimally. | cjalmeida wrote: | IMO, the main benefit of those exact algos are the they | provide optimality bounds, allowing you to make informed | decision to continue or stop and provide a non-optimal | solution. This is valuable in many practical situations. | | >Time scheduling, logistics, operations... | | Add marketing (pricing, assortment, ...) and finance | (portfolio, captial budgeting, ...) to those. | | And those are _huge_ domains, moving trillions of USD | globally each year. Many companies (think most of the Global | 2000) benefit from a couple % improvement on their operations | that lead to many $$ in savings. | adeon wrote: | Are there some interesting examples of real world applications | of this stuff? Is it like planning problems or supply chain | optimization etc.? | | I've used SAT solvers recreationally to try solve some | mathematical combinatoricsish-like problems but interesting | more down-to-earth applications have eluded me. I would love to | dig deeper into this stuff. | currymj wrote: | Gurobi (maker of a MIP solver) has a bunch of industry case | studies as marketing materials on their website: | | https://www.gurobi.com/case_studies/ | | however, it's mixed-integer programming rather than SAT | solvers/constraint programming, but similar in spirit. | jvanderbot wrote: | I used a MILP solver to optimize my ship loadouts in | Highfleet. It's rugged-looking, but works great. | https://hfopt.jodavaho.io | lqr wrote: | The field of "algorithms with predictions" studies how to use | predictions/learning within traditional CS problem settings: | | > _We introduce algorithms that use predictions from machine | learning applied to the input to circumvent worst-case analysis. | We aim for algorithms that have near optimal performance when | these predictions are good, but recover the prediction-less worst | case behavior when the predictions have large errors._ | | An example is using learning to decide which variable to | discretize next in a branch-and-bound integer linear programming | solver. The algorithm might be extremely similar to the classic | version, but the analysis is new. | | https://arxiv.org/abs/2006.09123 | | More broadly, I think there's a lot of work on applying more | refined analysis to classic algorithms - online settings, | smoothed analysis, etc. | sarojmoh1 wrote: | First Love?! Haha, jk. | | I'm currently re-studying them just bc I'm interview prepping. I | wonder if whiteboarding for DS/Algo is gonna be a thing of the | past too though. | | I would imagine there's certainly new DS/Algos in progress to | aide ML training efficiency | yieldcrv wrote: | Merkle trees and state tries were not in my formal education | | But I had to learn them and employ them alot lately, very good | for resource constrained environments | di4na wrote: | Parsing but also casting to string efficiently is still an open | problem with regular breakthrough. Ryu or Dragonbox for float to | string come to mind. | | Parsing with errors that are useful for the human is definitely | an open research topic. | | Capabilities have seen interesting work, in particular in | relationships to effect handlers. | | Hyperloglog and other datasketch keep seeing incremental | progress. I have not had time to read it all but Omar Ertl has a | massive reduction in memory use in UltraLogLog that i expect a | paper on soon. The code is already up. | hinkley wrote: | I've been poking at prometheus internals lately and I am | starting to wonder if histograms need some sort of hyperloglog | treatment to improve on buckets. They don't seem like a very | accurate solution. | di4na wrote: | You may appreciate ddsketch | https://www.datadoghq.com/blog/engineering/computing- | accurat... | | Or tdigest. For the prom use case, i usually prefer ddsketch. | Easier to integrate. Tdigest is really complicated. | | That said there are definitely interesting work that need to | happen on compressing time series histograms for heatmaps. | Iirc influx had some interesting work there. | epberry wrote: | Unfortunately for you my favorite recent algorithms result uses | AI as part of improving a very traditional algorithm. The paper | is "Learned Index Structures" | https://blog.acolyer.org/2018/01/08/the-case-for-learned-ind... | morkpek wrote: | This is a wonderful paper. In a similar vein, "Algorithms with | Predictions" (https://arxiv.org/pdf/2006.09123.pdf) presents a | bunch of fun algorithmic problems related to bringing AI into | traditional algorithms. | inciampati wrote: | Yes. There is ongoing work in making succinct full text indexes | that use substantially less space than the text they model. See | work on the r-index: https://doi.org/10.1089/cmb.2019.0309 | shae wrote: | I'd suggest looking up succinct data structures, cache oblivious | data structures, and probabilistic data structures. | mamcx wrote: | Anything specific that could cover what "NDArray-like" | structures are? In concrete, to make row-oriented/column- | oriented hybrids? | | (I wish to found a way to encode the equivalent of `struct | Data{col1:Vec<..>, ...>}) | LukeEF wrote: | Succinct data structures ftw. | LukeEF wrote: | How about some succinct data structures and delta encoding for | modern databases [1]. Succinct data structures are a family of | data structures which are close in size to the information | theoretic minimum representation (while still being queryable). | | [1] | https://github.com/terminusdb/terminusdb/blob/dev/docs/white... | TallGuyShort wrote: | Algs/DS are still pretty fundamental to AI. Hyperparameter | optimization algorithms like SHA/ASHA, stochastic gradient | descent, PBT, etc. Structures like transformers, the various | topologies, convolutions, etc. Not tickling the same itches? | abeppu wrote: | For numerical algorithms, probabilistic numerics is an | interesting research corner. It both provides a new perspective | on existing algorithms, and provides a framework for tailoring | new ones to specific problems. | | https://www.probabilistic-numerics.org/ | softwaredoug wrote: | Nobody has mentioned Approximate Nearest Neighbor search (aka | vector search), which IMO, are fundamental data structures | advancements. | | Basically given a set of million (billion, trillion...) ~1000 | valued vectors, and a query ~1000 valued vector, find the closest | vector in the indexed set. This is "nearest neighbor" search and | there have been increasingly more and more sophisticated | approaches: | | http://ann-benchmarks.com | | https://haystackconf.com/us2022/talk-6/ | | And has spawned companies like Weaviate, Pinecone, etc etc (a | half a dozen others) | marginalia_nu wrote: | What sort of dataset are you indexing that has trillion | entries? | | That's 100,000 english wikipedias or 50 googles. | maerF0x0 wrote: | There's about 14B trades per year on the NYSE which i'm sure | could represent 10x that in entities (buyer, seller, broker, | etc) and could easily hit 1000x that in log lines. The shares | per day is in the billions, so hitting 1T if each share is | represented uniquely. | marginalia_nu wrote: | You don't typically use vector search for trade data | though. It's already ridculously well structured. Assets | have identifiers, parties and counterparties have IDs, etc. | I'm not sure what nearest neighbors in a vector space would | add. | l33t233372 wrote: | Nonetheless it's an example you asked for of a dataset | with over a trillion entries. | marginalia_nu wrote: | I asked which dataset they were indexing that was of this | size, not whether any such dataset exists in other | domains. | yes_man wrote: | Dumb example but still example from practical world. Your | body (assuming you are a human) has trillions of cells. Each | cell way more complicated than what a 1000-dimensional vector | can represent, but maybe it could be a loose estimate of some | properties in each cell. Now the algorithm could be about | finding the most similar cell. Could be useful for finding | e.g. other cancer cells based on properties in one cancer | cell. | | Not that this is a practical example because we technically | cannot index all cells in each body. But even if such an | algorithm being studied today might be useful one day when we | do capability to collect such data | marginalia_nu wrote: | Where are you going to store this data? It's dozens of | petabytes. | tourist2d wrote: | Where do you think computers store data? | yes_man wrote: | Should theoretical research of data structures and | algorithms have been capped at 1GB in 1980 because that | was the biggest single hard drive available back then and | you couldn't store for example a 2GB dataset on a disk? | marginalia_nu wrote: | Not at all, I'll still call out fantastic claims when I | see them though. | karpierz wrote: | Google has definitely indexed over a trillion pages. | marginalia_nu wrote: | Do you have any sources for this claim? | | As far as I am aware Google doesn't publish any | statistics about the size of its index, which no doubt | varies. | sophacles wrote: | It's only a few racks worth of disk servers. | | If I was building it from my 5 minutes of googling, using | 15TB nvme u2 drives, and easily available server chasis, | I can get 24 drives per 2u of a rack. That's 360 TB + a | couple server nodes. So ~6u per PB. A full height rack is | 42u, so 6-7PB per rack once you take up some of the space | with networking, etc. So dozens is doable in a short | datacenter row. | | Realistically you could fit a lot more storage per U, | depending on how much compute you need per unit of data. | The example above assumes all the disks are at the front | of the server only, if you mount them internally also, | you can fit a lot more. (see Backblaze's storage pods for | how they did it with spinning disks). | | Dozens of PB is not that much data in 2023. | de_nied wrote: | One application are LLMs. I've seen a project use Pinecone to | enable "infinite context" for GPT-3. | marginalia_nu wrote: | This is still several orders of magnitude more items than | the entire training corpus for all GPT models combined. I | guess if you were to index individual codepoints in the | training corpus, we'd start to see those volumes. | fzliu wrote: | Most commonly used ANN algorithms today are clever | optimizations atop "old" data structures. DiskANN is a recent | favorite: https://milvus.io/blog/2021-09-24-diskann.md | kokanee wrote: | "Most commonly used X today are clever optimizations atop old | Y" is pretty much the story of technology, isn't it? | RoyGBivCap wrote: | It's lego all the way down to the clock cycle. | curiousgibbon wrote: | Negative weight single source shortest paths in near-linear time: | https://arxiv.org/abs/2203.03456 | | Obligatory Quanta link: https://www.quantamagazine.org/finally-a- | fast-algorithm-for-... | ww520 wrote: | Not sure it's still considered new as it has come out for couple | years, adaptive radix tree and its variants are pretty promising. | samwillis wrote: | Anything CRDT (conflict free replicated datatypes: | https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...) | related is fun to read up on and play with. | | Papers and references (page maintained by central academic in the | world of CRDTs): https://crdt.tech | | Group doing research into how they can be used to build | interesting collaborative (and async) applications: | https://www.inkandswitch.com | | A few of the major open source implementations - mostly for rich | text editing or JSON like data structures: | | - Yjs: https://github.com/yjs/yjs | | - Automerge: https://github.com/automerge/automerge | | - Peritext: https://www.inkandswitch.com/peritext/ | | - Dimond types: https://github.com/josephg/diamond-types | | People building eventually consistent database syncing with them: | | - https://electric-sql.com (Postgres <-> SQLite) | | - https://vlcn.io (SQLite <-> SQLite) | | Open source colaborative servers (coordination, persistance, | presence): | | - https://github.com/ueberdosis/hocuspocus | | - https://github.com/partykit/partykit | | - https://github.com/firesync-org/firesync | beck5 wrote: | Thanks for mentioning FireSync, we haven't even offically | launched yet! After 10+ years of building real time | collabrative apps based on Operational Transformation | (ShareLaTeX -> Overleaf.com) we have become quietly very | excited about CRDT's and Yjs. Now we are focused on building a | scalable Yjs compliant backend ontop of PG with FireSync. | | If anyone has thoughts about this space, feature requests, | would like a preview of what we are building, or anything else, | please do reach out direcly to me at henry@firesync.live, I'm | talking to as many people as possible at the moment. | josephg wrote: | Diamond types author here. I'm working on another blog post - I | had a big breakthrough getting collaborative text editing many | times faster again using some new ideas. I'm in the process of | writing up our new work now. | | And there's another hard problem I'm stuck on: so, CRDTs like | mine and Yjs use (random agent id, sequence number) as unique | IDs for operations. But a troublesome user might reuse an | operation ID, sending different operations to different peers | and giving them the same (agent, seq) pair. This results in | really awful, weird bugs. Martin Kleppman wrote a paper about | BFT in CRDTs. His suggestion is we do what git does and use SHA | hashes for IDs instead. | | But there's two problems: | | 1. We don't want to store a hash for every keystroke a user | makes. That would get big, fast. We want a DAG of hashes like | git, except a hash is generated with each keystroke. How do we | build an efficient system that can query by hash, without | storing all of the hashes? | | 2. I want users to be able to permanently delete inserted text. | If I accidentally paste an AWS key into a document, I don't | want that key in the document history forever. How do I enable | that while still getting all the benefits of the hashing system | above? (Note if your hash includes the typed character, even if | you delete the character itself from disk, you can trivially | brute force the hash to recover any typed characters). | | Each of these problems in isolation has a solution. But the | real prize is how do you solve both at the same time? I think | this problem has a solution but I don't think anyone has solved | it yet. Want to name an algorithm and get my eternal respect? | Here's your chance. | 020921 wrote: | I know you said you didn't like the idea of hashes, so this | might be a non-starter for your use case. | | But it sounds like a V5 Guid [0] might meet some of the needs | for your "operation Ids". The existing "random agent id", and | "sequence number" could be used as part of the namespace / | salt. | | [0] https://www.sohamkamani.com/uuid-versions- | explained/#v5-non-... | josephg wrote: | I don't have a problem with hashes generally. Correct me if | I'm wrong, but this scheme sounds like hashing with extra | steps? | | I agree this could be done. But diamond types (my library) | considers every keystroke to be a new change. If we store a | UUID for every keystroke, that's very inefficient. | texuf wrote: | Troublesome user? As in a malicious user? Are you trying to | lock down your attack surface against script kiddies or is | there another instance where this situation comes up? | josephg wrote: | Yes, a malicious user. I've also seen buggy software try to | reuse IDs - which causes similar mayhem. It would be better | if we could remove this entire class of problems. | lcnPylGDnU4H9OF wrote: | It could mean that, as in "someone who causes trouble" or | it could be a result of a bug which only affects a handful | of users who suddenly "have some troubles". | KineticLensman wrote: | > or is there another instance where this situation comes | up? | | Yes, if you are a Byzantine general trying to coordinate an | attack, and you aren't sure of your peers' true allegiance. | some_furry wrote: | Problem 2 can be solved by "cryptographic erasure" | techniques. | | 1. Encrypt the data with an ephemeral key. | | 2. Upon delete, just wipe the key. The data becomes | unrecoverable. | | I'm not sure about problem 1. A fast seekable stream cipher | like ChaCha might help here (don't use e.g. RC4 for this), | but a faster hash like BLAKE3 might be more appropriate. I'm | happy to noodle over this some more. These are just my | unfiltered thoughts. | | If you combined the two: Each entry in the graph could have a | public salt used for key derivation for the actual data. This | salt would not be part of the DAG, but stored outside of it. | To delete an entry, zero-out the salt and the underlying key | becomes unrecoverable, which makes the data unreadable, | without needing to change the history of the DAG. | josephg wrote: | Yes. But diamond types stores a new entry in the DAG for | every keystroke each user makes. The current scheme of IDs | being (agent, sequence number) allows subsequent operations | by the same user to be trivially run-length encoded. Eg if | I make 1000 changes in order, we can encode that as | ("seph",1..1000). It takes much less than 1 byte of | overhead on disk per character typed. | | Salts per character + cryptographic erasure would solve | problem 2. But how do we do it without a massive increase | in storage, memory use and network bandwidth? I don't want | to generate and store a cryptographically unique salt per | keystroke. (Even though it might run fine on modern | hardware). | benlivengood wrote: | 1. Can you merge inserts of characters into inserts of | strings reliably given both the original inserts and the | current document history? E.g. if the closed set of history | that only inserts in the middle of the subset of inserts that | you want to turn into a string those inserts can be | permanently merged into a string insert. | | 2. String merging should also allow null merges, e.g. | identity the same closed set of history inserts that refer to | only the inside of the original inserts, and so permanently | deleting history of characters in the document could be | replaced with a null insert that can still be referred to by | other inserts/merges. | flir wrote: | Would (1) hit problems because you may store them coalesced | eventually, but you sent them to the other clients as | individual events, where they may get interleaved with | events from that client? | | I think it might mess with fine-grained undo, also? | | (idk much about CRDTs) | | "Reject re-used sequence numbers" might be simpler? | debatem1 wrote: | Sounds like inverted hash chains. Generate a hash chain h_i = | sha(h_i-1) for a random h_0 up to h_N. Each message m_i | becomes (agent_id, state, hmac(h_(N-i+1), state), h_(N-i)). | Upon receipt of message m_(i+1) the receiver can verify the | hmac of the preceding state and that the key is the preimage | of the earlier message. By repeated hashing any gaps in the | sequence can be fast-forwarded through and still verified. | This provides for deletion without a loss of verifiability up | to length N. | joelg wrote: | it's definitely not an exact solution, but i've been thinking | about ways prolly trees could fit into CRDT systems. they | essentially give you efficient peer-to-peer entry-wise diffs, | so could be good for identifying conflicts after-the-fact. | but if you store operations in a prolly tree, you end up | hashing them all (plus logarithmic additional space | overhead), so it might be a non-starter. | | i have a blog post to shill if you haven't heard of them: | https://joelgustafson.com/posts/2023-05-04/merklizing-the- | ke... | vivegi wrote: | > 1. We don't want to store a hash for every keystroke a user | makes. | | So, why does it have to be every keystroke? We can batch them | locally. An approximate rule of thumb to use to guide the | batching is typing speed. Using 200 keystrokes/minute, we get | an average of 3.33 keystrokes/second. If we use 3 seconds | latency for the collaborative editing to _resolve_ concurrent | edits, we get 10 keystrokes per batch (3 secs). You can also | have an override where smaller batches are replicated to | nodes if sufficient time has elapsed since last local edit | (eg: 10 secs have elapsed and we have a partial batch of 3 | keystrokes only). | | > 2. If I accidentally paste an AWS key into a document, I | don't want that key in the document history forever. How do I | enable that while still getting all the benefits of the | hashing system above? | | In order for this to work, we should assume that every | replica's current state is always obtained by starting with | the empty document and then playing the operations log. We | should also permit the author of an operation to _rollback_ | the operation. So, I suppose if the collaborative editor | shows an interface of all operations that originated locally | and the author can select the operations that inserted the | AWS key into the replica, they can issue a _rollback | operations_ op and have it propagated to all replicas. | | Since the point-in-time state of the replicas are regenerated | by replaying the operations log, we would have deleted the | accidentally inserted key. | | In order to address the issue of the AWS key still being in | the operations log, we can define the semantics of the | _rollback op_ to also include secure erasure of the previous | operation from the operations log (i.e., the rollback op | would update the original operation into a no-op. | | So, when all replicas apply the _rollback op_ initiated by | the author, eventually all replicas will converge to having | the accidentally inserted AWS key removed from the replica | and the operations log. | amelius wrote: | I think keystrokes was just a metaphor for whatever | operation the application has to deal with. | vivegi wrote: | I was responding to the diamond-types CRDT author's | question in the parent comment. Their github project page | [1] mentions text editing a lot: | | > This repository contains a high performance rust CRDT | for text editing. This is a special data type which | supports concurrent editing of lists or strings (text | documents) by multiple users in a P2P network without | needing a centralized server. | | > This version of diamond types only supports plain text | editing. Work is underway to add support for other JSON- | style data types. See the more_types branch for details. | | In any case, I agree with the metaphor and batching | granular operations can always be done. | | [1]: https://github.com/josephg/diamond-types | gigatexal wrote: | I like your approach to thinking about this. Seems like a | plausible set of assumptions to make. | ramses0 wrote: | 1. Bloom for existence, re-create via brute-force search (?), | checkpoint hash every $N seconds(?) | | 2. Godbolt => "This rendered text came from these portions of | the DAG/tree, press CTRL-Del to irrecoverably remove them | from the DAG and/or replace them with 'REDACTED'". | | #1 is "simply" related to efficiency. You can lean on | probabilistic data structures to see "yeah, probably this was | HERE in the history tree", and then perhaps replay the "N" | seconds checkpoints with the original input data to validate | or match exactly where the hash is/was. | | #2 is the same problem GIT has, where you need to rewrite / | filter / "detached head" in order to obliterate something | from history, and is somewhat a local/remote UI issue. "A | request to obliterate history is being built in parallel over | here => click here to adopt it, and diff/transfer only your | new work to it" | | Related to #2 I've had a thought of a social network | protocol, and how to "delete malicious nudes" or "oops, I | accidentally published my SSN and Tax Filing *.pdf to my | friends". On the "real" open internet, google or malicious | actors would/could store that info forever. However, in a | semi-closed group (eg: 100-1000 "friends of friends") that | are "all" running participating/conforming clients, a | "request-to-delete" or "request-to-disavow" is particularly | valuable. | | My imagined protocol is: "post001.txt" => "$GOOD_SHA", | "nudes.jpg" => "$WHOOPS_SHA", "taxes.pdf" => "$YIKES_SHA", | "bait.txt" => "$BAIT_SHA". | | "Of course" everything is content-addressable, ie: anyone can | fetch "post001.txt" from anyone as "$GOOD_SHA", however, "it | would be nice" if I as a participant in the network could | query "how_many( $GOOD_SHA );" and see distribution across | the network participants. "disavow( $WHOOPS_SHA, $YIKES_SHA ) | && how_many( $WHOOPS_SHA, $YIKES_SHA )" => ie: in a "good" | network, after a "disavow(...)" then nodes should not respond | with that availability. Occasionally throw in "$BAIT_SHA w/ | disavow(...)" and make sure that no one is propagating it. | | Similar to "key revocation lists", you could eliminate/reject | any clients which do not respect requests of non-propagation | for $BAIT_SHA or anything that is "disavow(...)'d". Same | story with anyone hosting | https://en.wikipedia.org/wiki/Celebrity_sex_tape ... I don't | really want them in my network, so query for particular | content, and if they respond, I can nudge/boot/disengage from | users that are hosting them (same with eg: "I love my AK-47 | and assault rifle" political memes, disengage / dislike / | spread apart people sharing/hosting/propagating content that | I don't want). | | While you're still left with malicious / non-conforming nodes | potentially archiving everything for all eternity, there is a | relatively immediate ability to fracture / exclude (shun) | nodes that are detected as non-conforming, so "your stuff" is | still out there, it's just relatively inaccessible, and you | have to be "careful" to have your client be up-to-date w.r.t. | revocation actions otherwise you'll be excluded from | networks. | | Meh. Kindof complicated, but those seem to be in the range of | characteristics that you're talking about as well. | samwillis wrote: | If my understanding of byzantine fault tolerance with CRDTs | and collaborative editing is correct, it's really only a | problem in systems where you have an untrusted user, someone | who is going to actively try and corrupt the system? | | It seems to me that for the majority of use cases, | collaborative editing in a work sense, you would normally | have trust that a user isn't actively hostile. And so I'm not | convinced that BFT needs to be a high priority goal in most | of these implementations. | | I think I higher priority issue is with how to aid developers | with the use generic CRDT structures within a system where | all the rules are not encoded into the CRDTs themselves. So | for example when using Yjs with Prosemirror, Prosemirror has | its own schema that defines how a document can be structured. | However not all those rules are encoded into the underlying | Yjs structure, it's possible to have two conflicting edits | that results in a Yjs document that doesn't comply with the | Prosemirror scheme. The current system throws away any none | conforming data when the document is loaded or updated. | Because of this you need to load the document into a server | side instance of Prosemirror in order to apply that schema | when doing a server side merge, or even if you are just | rendering the static content after "blindly" merging the | documents. | | My point really is that CRDTs are an encoding of user | _intent_ , and when these generic CRTD are merged you have a | structure representing merged user intent, not a final | conforming state. The solution is either domain specific CRDT | that are tailored exactly to you application, or a schema | system for these generic types that apply rules when merging. | Prosemirror/Yjs does (mostly) exactly this but I think it | needs to be solved in a more generic way that can be run on | any layer of the system. | josephg wrote: | Yes - The BFT problem only matters when you have Byzantine | actors. But I think users deserve and expect the system to | be reasonably well behaved and predictable in all | situations. Anything publically writable, for example, | needs BFT resilience. Or any video game. | | As for the prosemirror problem, I assume you're talking | about weird merges from users putting markdown in a text | crdt? You're totally right - this is a problem. Text CRDTs | treat documents as a simple sequence of characters. And | that confuses a lot of structured formats. For example, if | two users concurrently bold the same word, the system | should see that users agree that it should be bolded. But | if that "bold" intent is translated into "insert double | asterisks here and here", you end up with 4 asterisks | before and after the text, and that will confuse markdown | parsers. The problem is that a text crdt doesn't understand | markdown. It only understands inserting items, and holding | something shouldn't be treated as an insert. | | JSON editing has similar problems. I've heard of plenty of | people over the years putting json text into a text crdt, | only to find that when concurrent edits happen, the json | grows parse errors. Eg if two users concurrently insert "a" | and "b" into an empty list. The result is ["a""b"] which | can't be parsed. | | The answer to both of these problems is to use CRDTs which | understand the shape of your data structure. Eg, use a json | OT/crdt system for json data (like sharedb or automerge). | Likewise, if the user is editing rich text in prosemirror | then you want a rich text crdt like peritext. Rich text | CRDTs add the concept of annotations - so if two users bold | overlapping regions of text, the crdt understands that the | result should be that the entire region is bolded. And that | can be translated back to markdown if you want. | | Luckily, in over a decade of work in the collaborative | editing space, I haven't found any other examples of this | problem other than rich text and json. I think if a | collaborative editing system supports json data, rich text | and plain text then it'll be enough to add collaborative | editing support to 99% of applications. | | The ink & switch people did a great write up of how rich | text CRDTs work here: | https://www.inkandswitch.com/peritext/ | jsunderland323 wrote: | Hi Joseph, | | This is a hard problem and something that I think has to be | intrinsic to the serialization method of your CRDT. Unique | IDs of any kind are really risky in CRDTs because they | basically break the mathematical links between potential | partitioned duplicates. Lists make this problem even harder | because you have to contend with if you should incorporate | the ordinal of each element into the hash value. | | I'm working on a version control system for visual programs. | You wrote a response to someone on structured edits a few | months back about how the UNIX fs not being schema/structure | aware is the core problem with collaborative edits and | destroys open interoperability. I think you hit the nail on | the head. I've been working on something for nearly a year | now to try to address that problem. Let me know if you want | to compare notes. | jsunderland323 wrote: | To answer your question 2. I don't think you can | realistically delete a key, contend with network | partitions, and have a true eventually consistent data | structure. I think you're sort of running into the uncommit | problem. Feels solvable but at the expense of another trade | off you don't want to make. The solution here is really in | workflow. Git solves it by separating the act of committing | and pushing. If you push a commit containing a secret, the | only thing you can do is invalidate the secret or not push | your offline branch containing the commit in the first | place. | hnfong wrote: | CS 168: The Modern Algorithmic Toolbox, Spring 2023 | | https://web.stanford.edu/class/cs168/index.html | | I didn't have time to go through all of this, but the parts that | I did read were very interesting to me. | | It's not even cutting edge new stuff of 202x, probably most of | them are new-ish results from ~2000 onwards. | eternalban wrote: | This is something I planned (2015) on sharing at some point but | then years flew by and here we are .. :} | | It is a cacheline sized 'container' (CLC) of machine-word length | records, with one record used to store the record order and | remaining bits for metadata. So you can project different kinds | of container semantics, such as FIFO or LRU -- any ordered set | semantics -- on this meta record. Using arrays of CLCs you can | create e.g. a segmented LRU, where the overhead per item is | substantially less than a conventional pointer-based | datastructure, and, is naturally suited for concurrent operations | (for example by assigning a range to distinct worker threads), | and ops require a few or couple of lines to be touched. The LRU | (or whatever) semantics in aggregate will be probabilistic, as | the LRU order is deterministic per unit container only. It is | very fast :) | | https://github.com/alphazero/libclc/blob/master/include/libc... | | https://github.com/alphazero/libclc/blob/master/include/libc... | | As for the non-deterministic aspects: Since container semantics | e.g. LRU order is only applied at unit level, the overall cache | is ~LRU. We can strictly quantify the 'ordering error' by | observing the age of items in FIFO mode as they are removed: for | a deterministic container the age of the item is equal to the | total capacity of the queue, for a segmented (array) composed of | FIFO queues, the age will have a effectively gaussian | distribution around the capacity (number of units x unit | capacity). But since these containers can use as few as 9 bits | per entry vs 24 or more _bytes_ for pointer based solutions | (which use linked-lists), for the same allocation of memory, the | capacity of the array of CLCs will be much greater, so, the | distribution tail of 'short-lived' items will actually be longer | lived than items in a strict queue for the same exact memory. | Additional techniques, such as n-array hashing, and low order | 'clock' bits at container level, can tighten this distribution | significantly (i.e. ~LRU -> LRU) via better loading factors. | hidudeb wrote: | [flagged] | sanxiyn wrote: | Theoretical advances are constant (just check the literature), | but two particular advances in the last decade are of practical | importance: one for sort and one for hash table. As in: check the | implementation you are using, if it doesn't incorporate these | advances you are leaving substantial performance (2x is not out | of the question) on the table. | | How Branch Mispredictions don't affect Quicksort (2016) | https://arxiv.org/abs/1604.06697 | | Quicksort is still the algorithm of choice for general purpose | internal unstable sort. In modern hardwares, quicksort spends a | lot of time recovering from branch misprediction, because | quicksort's comparison branches are inherently unpredictable. The | problem is severe to the extent that with branchful quicksort, | uneven split with more recursion is faster than even split, | because it makes branches more predictable! Exact penalty is | hardware microarchitecture specific, but one report is 20:80 | split being optimal. | | So... you should use branchless quicksort! The technique was | pioneered by an implementation called BlockQuicksort, but pdqsort | (for pattern-defeating quicksort) | https://arxiv.org/abs/2106.05123 is also popular. pdqsort | incorporates branchless technique and also includes other | enhancements. | | Designing a Fast, Efficient, Cache-friendly Hash Table, Step by | Step (2017) https://abseil.io/blog/20180927-swisstables | | Hash table is the workhorse of data structures, and while | improvements are possible for specific usage patterns, open | addressing with linear probing is still the fastest | implementation for general purpose. So called "Swiss Table" | combines cache-friendly metadata layout with SIMD-enhanced | probing. These two optimizations work well with each other and | combination of two is particularly effective. | _dain_ wrote: | _> pdqsort (for pattern-defeating quicksort)_ | | I'm sad that it's not "pretty darn quicksort" | sanxiyn wrote: | It probably is that, just not officially. You know, | officially, BFR stands for Big Falcon Rocket. | anonymoushn wrote: | For hit-heavy workloads, I have been unable to get Swiss tables | to compete with tables that hit only ~1 cache line per query. | hinkley wrote: | I recall years ago someone suggesting that a 1:2 split being | faster and I'm wondering if they were seeing the same thing, | and either didn't follow it through to conclusion or the | goalposts keep moving in each generation of processors. | sanxiyn wrote: | It is plausible that they were seeing the same thing. | xiaodai wrote: | Cache oblivious algorithms and lock free data structures | Twisol wrote: | E-graphs are pretty awesome, and worth keeping in your back | pocket. They're like union-find structures, except they also | maintain congruence relations (i.e. if `x` and `y` are in the | same set, then `f(x)` and `f(y)` must likewise be in the same | set). | | https://egraphs-good.github.io/ | | (Incidentally, union-find structures are also great to know | about. But they're not exactly "new".) | giogadi wrote: | IMO the hip new thing in algorithms and data structures is to | Just Use Arrays For Everything | dymk wrote: | Surely it's the interesting things built on top of the arrays? | yamazakiwi wrote: | Are you talking about Vector DB's? | masklinn wrote: | Also ECS / columnar data? | bckr wrote: | Got a source? | CountHackulus wrote: | I'm not OP, but I'm guessing he could be referring to | something like this talk? | https://www.youtube.com/watch?v=LrVi9LHP8Bk | | I'm honestly not sure otherwise. | eternalban wrote: | When skirt lengths change and we observe no measurable change | in human anatomy, that's fashion and to follow fashion is | "hip". | | When the computing platform changes and you have to start | paying attention to L-level caches and counting your bytes (so | you can do you big compute in memory), and use n cores, and | then arrays are adopted, that is not fashion; so not "hip", | rather _modern_. | transformi wrote: | Follow-up, anything in graph-theory? | kzrdude wrote: | I'm not exactly up to date, but last few years has seen | practical improvements to graph isomorphism algorithms. | bmitc wrote: | I'm interested in knowing about graph layout algorithms that | allow specific constraints. If anyone knows about these things | please let me know. I do not, but I have a use case for them. | | For example, the constraints may be: | | 1) The graph is read left to right. This informs the layout such | that those spiral and circular graph layout algorithms that blow | out the graph do not work. | | 2) Try to keep certain types of nodes vertical aligned, so that | they lie on the same horizontal line. | | 3) Minimize edge crosses. | | 4) Have the edges be not lines but paths made up of horizontal | and vertical lines. This probably requires some path finding as | well. | | 5) Be able to live update in the sense that a new node or edge | entry into the graph does not drastically re-layout the graph for | reasonable inserts. | alejohausner wrote: | Look at graphviz. Their "dot" program satisfies 1 through 4. | Your fifth wish is hard to satisfy. ___________________________________________________________________ (page generated 2023-05-10 23:00 UTC)