[HN Gopher] Building a Curve25519 Hardware Accelerator
       ___________________________________________________________________
        
       Building a Curve25519 Hardware Accelerator
        
       Author : parsecs
       Score  : 163 points
       Date   : 2021-07-15 14:37 UTC (8 hours ago)
        
 (HTM) web link (www.bunniestudios.com)
 (TXT) w3m dump (www.bunniestudios.com)
        
       | upofadown wrote:
       | >The "double ratchet" algorithm is integral to modern end-to-end-
       | encrypted chat apps, such as Signal, WhatsApp, and Matrix.
       | 
       | Signal protocol uses a hash ratchet for the message to message
       | forward secrecy. So you don't need to do the expensive key
       | agreement stuff unless that hash ratchet falls out of
       | synchronization.
       | 
       | It isn't really clear if any messaging application really needs
       | message to message forward secrecy in the first place. You really
       | only need to do that expensive key agreement operation when you
       | want to get rid of some old messages that someone has captured
       | off the wire. That can be reasonably done at the end of a session
       | or even weekly. Very few people need messages that have not even
       | scrolled off the screen to be forward secret, assuming they need
       | it at all.
        
         | tptacek wrote:
         | _You really only need to do that expensive key agreement
         | operation when you want to get rid of some old messages that
         | someone has captured off the wire._
         | 
         | Could you perhaps share more of your analysis here? I'd like to
         | be able to better follow what you are trying to say.
        
           | upofadown wrote:
           | If you are using a temporary shared secret then after you
           | forget that shared secret you will need a new secret which
           | you will have to generate. I mean nothing more than that.
        
             | tptacek wrote:
             | Right, why would anyone ever be better off holding that
             | temporary secret over many messages and many days?
        
               | upofadown wrote:
               | My entire point is that it is not really necessary to do
               | shared secret generation at any but a very slow rate.
               | Even if you don't so it the Signal protocol way it is
               | still perfectly reasonable to only do it once in a while.
               | That might be a short while but it could reasonably be
               | quite long.
        
               | tptacek wrote:
               | Why is it reasonable to only do shared secret generation
               | once in awhile? What's the win to doing that? The win to
               | not doing it is that it narrows the blast radius of a
               | single key compromise to a small number of messages (or,
               | in full ratchets, to a single message). Why would I ever,
               | ever want to _increase_ the blast radius of a key?
        
         | admax88q wrote:
         | I'm not sure you gain much by moving the forward secrecy to a
         | weekly or session based level, and I think you lose a lot in
         | terms of simplicity.
         | 
         | It is simpler to design and analyze if it applies at every
         | message. It's a design for the best case. No longer need to
         | make judgement calls on if 1 week is enough, or maybe 2, or
         | maybe some middle ground.
        
         | toast0 wrote:
         | > That can be reasonably done at the end of a session or even
         | weekly.
         | 
         | At the end of a session might be reasonable cryptographically,
         | but isn't really reasonable in terms of application
         | development; you don't consistently get to have use CPU and
         | network when a session ends and you may not get access again
         | until a new session begins. A weekly trigger is more likely to
         | be practical, but that may not work consistently either. Better
         | to do the ratchetting as you're doing the messaging to
         | guarantee it gets done.
        
         | sliken wrote:
         | Say it takes a $5,000 machine 24 hours to break the code.
         | Clearly it's much more attractive to an attacker if they can
         | break a weeks worth of messages than 10 seconds worth.
        
       | OJFord wrote:
       | > It's a weird thing about academics -- they like to write papers
       | and "share ideas", but it's very hard to get source code from
       | them.
       | 
       | I suppose at least part of the reason, especially for things like
       | crypto and statistics, is that they don't want to get bogged down
       | in plausibly-correct claims of fatal flaws, and be perhaps
       | repeatedly (over years hence) put in a defensive position of
       | having to prove the criticism _in_ correct or have everyone
       | assume the work was wrong?
       | 
       | I can sort of understand that. The same's true for the papers
       | themselves of course, but I can see it being more annoying for
       | code. ('But it's never going to actually be in that state', etc.)
        
         | JJMcJ wrote:
         | Donald Knuth once wrote a paper with some source code.
         | 
         | He warned his readers about just copying the code, perhaps half
         | jokingly.
         | 
         | He said, approximately, "this code has been proven correct, but
         | not tested."
        
         | bunnie wrote:
         | You're probably right. And assuming you are, then I think maybe
         | the core problem may be that academics even tolerate the petty
         | undercutting of reputation based not on the merit of the big
         | ideas, but rather minor flaws in the proofs of concept.
         | 
         | There's also a problem in allowing papers to get away with
         | making arbitrary claims that are by definition irreproducible.
         | Every paper always includes some section about how many gates a
         | design took up and how fast it ran -- but there aren't entirely
         | consistent standards for reporting these things; it's often too
         | easy to hide significant setup costs when running benchmarks.
         | 
         | This leads to an incentive to overstate results, because
         | without any threat of checks it would be disadvantageous to not
         | cherry-pick the absolute best numbers you could get through a
         | peer review committee.
         | 
         | I get the desire to escape petty criticism: but it seems like a
         | weak argument against the myriad of reproducibility and
         | integrity hazards such a lack of transparency begets,
         | especially for a line of work that ostensibly distinguishes
         | itself on high standards for merit and integrity.
         | 
         | It also would have been nice to just receive a polite response
         | of the form of "sorry, I don't support this anymore, I've moved
         | on..." from anyone I had reached out to. Before starting on
         | this project, I emailed four or five groups that had all
         | published various papers on Curve25519 accelerators hoping to
         | get even a broken non-compiling repo that could have helped me
         | sanity-check my understanding of the thornier concepts
         | presented in the papers.
         | 
         | Not even a single response...hence the motivation to write up
         | notes and share them. Seems a waste of funding to have so many
         | implementations, but no way to actually use any of them...
        
         | staticassertion wrote:
         | Are you saying this based on experience? My assumption was that
         | it was a number of factors.
         | 
         | 1. Researchers are often weak coders, and they may be self
         | conscious about their code.
         | 
         | 2. My own experience is that researcher code is totally
         | undocument, or broken. I've repeatedly found that the code
         | linked to from a paper has no straightforward way to be built,
         | or appears to have compiler errors etc.
         | 
         | 3. It's more work, and researchers have the incentive to put
         | out a compelling paper, not to put out code (because we don't
         | hold them to that standard).
         | 
         | My own experiences have led to me to mostly dismiss papers
         | where I can't find the code or can't get it to build.
        
           | kzrdude wrote:
           | Given all that, it's understandable that it's not released
           | and completely reasonable - a support structure is needed if
           | the code should be released.
        
             | staticassertion wrote:
             | I'm setting the bar super low. A `./configure` and a readme
             | with basic info about the project and how to build/run it
             | would be quite exceptional. I don't need them to be
             | triaging github issues and merging in PRs.
        
         | gww wrote:
         | In the biological sciences there is a huge push for the code to
         | be released alongside the paper. Many publications (but not
         | enough yet) include a complete pipeline in a workflow language
         | that downloads, processes the data and generates the figures
         | used in the paper.
        
           | toomuchtodo wrote:
           | What's the language de jour typically leveraged to declare
           | these workflows?
        
             | semi-extrinsic wrote:
             | Since GP mentions biology, I would guess R is the first
             | choice.
        
       | jlrubin wrote:
       | Hi Bunnie!
       | 
       | Curious if you also have a power usage comparison, which I'm
       | guessing is also relevant in the mobile context.
        
       | korethr wrote:
       | This is neat. The ever-present admonition against rolling one's
       | own crypto lurks in the back of my mind, though. It is not my
       | intent here to impugn Bunnie's competence, though I do wonder if
       | he's run the design or implementation past any cryptographers to
       | try to verify that the design or implementation are sound, and
       | don't subtly break critical assumptions vital to the security of
       | the algorithm.
        
         | pinewurst wrote:
         | But he's not rolling his own crypto - in an algorithm sense -
         | purely trying to speed the execution of an existing algorithm
         | which can be easily validated against other existing
         | implementations
        
           | korethr wrote:
           | Sure, but is it not possible that a hardware acceleration, if
           | not also proven correct, could subtly undermine an otherwise
           | valid software implementation?
           | 
           | I do not assert that this is the case generally, nor
           | particularly here -- I would be quite happy to be wrong about
           | this. But if there's one thing I've taken from my admittedly
           | minor study of crypto is that the most trivial of details can
           | be critically important, and failure to get a detail exactly
           | correct can compromise an entire algorithm and anything built
           | thereupon.
           | 
           | I would love to see hardware acceleration for ed25519, just
           | like there's been hardware acceleration for AES. It could
           | help drive adoption of an algorithm that is less fiddly and
           | easy to get wrong than RSA, and maybe in another 15 years,
           | I'd be able to purchase IT equipment that supports ed25519
           | and not just hardware accelerated RSA and AES.
        
       | IfOnlyYouKnew wrote:
       | Excellent deep dive on this beast that had me similarly question
       | my intelligence in the past.
       | 
       | As a former academic: the reason sharing source code is/was rare
       | is that it tends to be extremely ugly, and that you don't get any
       | credit for publishing it where it matters (I. e. tenure
       | committees). To go from something that works to something that
       | you would allow people to see takes about as much time as the
       | original work.
       | 
       | (As a side note: I feared for the worst when I saw, among tags
       | like "hacking" and "open source", the tag "feminism" in the side
       | bar. I enjoyed being wrong.)
        
         | vlovich123 wrote:
         | I don't think that's a good reason. Production code is also
         | ugly.
        
       | sowbug wrote:
       | This work is associated with Precursor
       | (https://www.bunniestudios.com/blog/?p=5921), which is Bunnie's
       | project to build a mobile hardware platform from the ground up.
        
       | jazzyjackson wrote:
       | TIL 255 is divisible by 17
       | 
       | Edit: thanks to my replies I know several more things now lol big
       | day for me
        
         | pavpanchekha wrote:
         | 256 = 16 * 16
         | 
         | 255 = 256 - 1 = 16 * 16 - 1 = 15 * 17
        
         | Sesse__ wrote:
         | 2^p - 1 can only be prime (a so-called Mersenne prime) if p is
         | prime. 2^8 - 1 thus cannot be a prime, since 8 is composite.
         | 
         | (Note that is it not _sufficient_ that p is prime, but it is
         | necessary.)
        
         | pfortuny wrote:
         | (a^2-1)=(a+1)(a-1).
        
         | jandrese wrote:
         | It is also divisible by 5.
        
       | ChuckMcM wrote:
       | This is an excellent read[1]. It reminded me of the process I
       | went through putting together an FFT engine in an FPGA although I
       | think the small execution engine is just stellar, and not
       | something I had considered.
       | 
       | [1] To be fair I find most of the stuff Bunnie writes about to be
       | worth reading! :-)
        
       | staticassertion wrote:
       | > Wait, what? So many articles and journals I read on the topic
       | just talk about prime fields, modular reduction and blah blah
       | blah like it's asking someone to buy milk and eggs at the grocery
       | store.
       | 
       | No kidding, it's so frustrating! I have to read some articles/
       | papers like 20x and open a bunch of wikipedia tabs to understand
       | wtf they're talking about. If they gave a simple, high level
       | explanation it would save tons of time - while a wikipedia
       | article is going to be very in depth, it's not tailored to the
       | context of what I'm reading, so I often have to look at a whole
       | bunch of things to start to build a good model in my head for wtf
       | they're trying to convey.
       | 
       | This post, on the other hand, is perfect for me. Thank you so
       | much for writing it I'm learning a ton.
        
         | dragontamer wrote:
         | https://www.youtube.com/watch?v=xE4jEKx9fTM
         | 
         | GF5 is the 3rd smallest finite field (GF2 and GF3 are smaller),
         | but GF2 and GF3 are too small to really see where the beauty of
         | all this math lies.
         | 
         | ------
         | 
         | GF5 (which is simply "arithmetic modulo 5") has all the
         | properties of Real-numbers "we care about", and none of the
         | hassles of decimals or partial numbers.
         | 
         | GF5 is simply arithmetic modulo 5. Its really easy to do. 4 + 2
         | == 1. (Aka: 4 + 2 mod 5 == 6 mod 5 == 1). Addition and
         | multiplication are easily proven to be the same.
         | 
         | The fun starts when you start doing roots, exponents and
         | logarithms (!!!). Take the definition of addition /
         | multiplication above (aka: addition / multiplication modulo 5)
         | and redefine square-root, exponents, and logarithms to be
         | consistent with it.
         | 
         | That is to say: 2 * 3 == 1. Which means that 3 == (1/2) (yes,
         | "one half" is now 3 in GF5). Which means 3 is 2^(-1).
         | 
         | * 2^2 == 4.
         | 
         | * 2^2 * (2^-1) == 4 * 3 == 12 mod 5 == 2.
         | 
         | Which means 2^2 * 2^-1 == 2^(2-1) and look, exponents work as
         | expected. Feel free to play around with all the numbers: you'll
         | find that exponents work once you match this new "definition"
         | of exponents.
         | 
         | ---------
         | 
         | Because exponents work, logarithms work. Because everything
         | works, we can have complicated ellipical curves defined over
         | the numbers {0, 1, 2, 3, 4} (aka: the numbers mod 5) and have
         | everything work out mathematically.
         | 
         | --------
         | 
         | The hard part is proving that all of this works in GF(5^2) aka
         | GF25. It turns out that you can't just "mod 25" the stuff, you
         | need to do an extension field (and basically make a complex
         | number out of things).
         | 
         | Once you understand GF25 extension fields and its relationship
         | with GF5, then it becomes easy to generalize that to GF2 and
         | its extension fields GF(2^8) (aka: binary numbers of 8 bits) or
         | GF(2^32) (aka: 32-bit numbers).
         | 
         | Turns out for extension fields (like GF(5^2) aka GF25) you need
         | to redefine multiplication and addition as well and then
         | rebuild all the math from the ground up off of this new
         | definition of addition and new definition of multiplication.
         | And again, this is most important for GF(2^32), aka the 32-bit
         | number field that's common on today's computers.
         | 
         | Bonus points: GF(2^32)'s new addition operator is just an xor.
         | Only multiplication is difficult.
         | 
         | But once you do that, it all __JUST WORKS__ and is amazing.
         | 
         | ----------
         | 
         | Galois Fields are a way to "cheat" a definition of logarithm,
         | exponents, square roots, multiplication, division, and addition
         | (by redefining these operations).
         | 
         | It turns out that these operations continue to work "as
         | expected" in the Galois Field... as long as you're cool with
         | the new definitions of them.
         | 
         | ------
         | 
         | EDIT: I'm slightly wrong. Seems like Curve25519 is a prime
         | field and not an extension field. So this is more similar to
         | GF5 than it is to GF(2^32). Still, other elliptical curves are
         | in GF(2^32) or other extension fields, so the study of
         | extension fields is useful for other curves.
         | 
         | Curve25519 can be understood with just the GF(5) description I
         | gave earlier in this post. Prime fields are really simple: just
         | mod(the prime number) and you're set. There's some weird
         | properties that can accelerate speed (Any division by Foo is
         | equivalent to multiplying by (1/Foo), which might be faster).
        
           | fogof wrote:
           | > GF5 is the 3rd smallest finite field (GF2 and GF3 are
           | smaller)
           | 
           | There is also a field of 4 elements, since 4 is a power of a
           | prime.
        
             | dragontamer wrote:
             | Fair. I guess I was implicitly staying away from extension
             | fields because people's attention span will start drifing
             | away the minute I start to attempt to explain how to
             | perform modulo polynomial arithmetic over irreducible
             | polynomials.
             | 
             | Prime fields (GF2, GF3, GF5...) are just easier to explain
             | than extension fields (GF (2^2), GF(3^2), GF(5^2)...
             | GF(2^8)...)
        
         | Kalium wrote:
         | This feels, to me, like the complaint of someone with a general
         | technical grounding going into a narrow specialization and
         | discovering they understand none of the terminology.
         | 
         | This is _exactly_ how I felt when I was learning Haskell and
         | hearing about eta conversions and pointfreeness and so on. Each
         | new bit of terminology had a precise definition, but
         | specialists writing primarily for one another were not inclined
         | to offer definitions each time. They didn 't need them and
         | neither did their anticipated audience. I had to learn the
         | terminology and jargon so I could understand what they were
         | talking about, and high-level abstracted explanations were not
         | going to be enough to participate usefully. I had to learn the
         | basics.
         | 
         | Similarly, Watson & Crick's seminal paper might not have been
         | the best of all possible times to stop and define an X-ray or
         | how crystallography worked.
         | 
         | So yes, just instantiate the metaclass, implement the
         | interface, and go buy milk. Easy. You just have to learn what
         | each of those things actually means, especially if the details
         | might be important (and in cryptography, that's always the
         | case).
        
         | bunnie wrote:
         | Thanks! It's good to know I'm not alone in feeling this way.
         | 
         | This is actually an abridged version, if you click into the
         | link for the "datasheet" (https://ci.betrusted.io/betrusted-
         | soc/doc/engine.html) you can find all the gory details down to
         | the register level...
        
           | staticassertion wrote:
           | Awesome, much appreciated, I'll dig into it. I was actually
           | just using the dalek crate myself quite heavily this week, so
           | this is great timing.
        
           | amirhirsch wrote:
           | I liked your struggle with the corner-case :)
        
       ___________________________________________________________________
       (page generated 2021-07-15 23:00 UTC)