[HN Gopher] Parsix: Parse Don't Validate
       Parsix: Parse Don't Validate
       Author : Iazel
       Score  : 105 points
       Date   : 2021-05-15 15:38 UTC (7 hours ago)
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
       | ledauphin wrote:
       | I've been looking for a solid Typescript implementation of "parse
       | don't validate" that performs runtime parsing using semantics
       | attached to the defined Typescript types themselves. In other
       | words, much like attrs for Python, I want to be able to define a
       | low/no-boilerplate type, and then register parsers for those
       | types that will work recursively to parse my data, resulting in
       | the specified Typescript type.
       | Has anyone seen or written something like this?
         | iddan wrote:
         | It is definitely possible as Flowtype got it right. I hope one
         | day it will come to TypeScript as well
         | brundolf wrote:
         | Use io-ts: https://github.com/gcanti/io-ts
         | You define a decoder schema, and then the resulting TS type
         | gets automatically derived for you. You can then run data
         | through the decoder, it will err if there's a mismatch, or
         | return a value of the inferred type otherwise.
         | bmuon wrote:
         | I've been using this small library inspired by Elm/Swift
         | decoders [1]. It works, but it's not low boilerplate.
         | I'm gravitating towards GraphQL now because strict parsing is
         | built into it, so there is no need for all this boilerplate.
         | https://www.npmjs.com/package/@mojotech/json-type-validation
           | ledauphin wrote:
           | we use GraphQL for this purpose as well, but I'd also like to
           | be able to validate across other boundaries.
           | However, as I'm saying this, I wonder if I've been looking at
           | this problem wrong. Since we already generate types from
           | GraphQL schemas, maybe I should figure out how to use the
           | same client side parser that's already in my GraphQL client,
           | define a GraphQL schema for the types I'm interested in, and
           | then just generate and use those types.
           | One thing that doesn't necessarily give me is the ability to
           | define custom parsers corresponding to custom types. At
           | least, I think most of that sort of thing is usually done
           | server side with GraphQL.
           | So, thank you for the link and also the inspiration for
           | considering an alternative.
         | renke1 wrote:
         | Not exactly what you want, I think, but there is zod [0].
         | I really would like to see nominal typing support in
         | TypeScript. Currently, it's hard to validate a piece of data
         | (or parse for that matter) once and have other functions only
         | operate on that validated data. There are (ugly?) workarounds
         | though [1].
         | [0]: https://github.com/colinhacks/zod [1]:
         | https://gist.github.com/dcolthorp/aa21cf87d847ae9942106435bf...
       | brundolf wrote:
       | Similar thing for TypeScript: https://github.com/gcanti/io-ts
       | billytetrud wrote:
       | To me this just looks like they're arguing for using class types
       | rather than raw strings. The parsing seems kind of orthogonal and
       | a special case of the kinds of validation you might want to do.
       | It's also misleading in that the code is still doing validation,
       | just in a different place.
         | Iazel wrote:
         | Yes, it is basically a combination of proofing some data has
         | been validated by encoding this proof in a specific type, like
         | Email :) We want to popularize this idea and make it easier to
         | work with it by offering some nice, type-safe abstraction.
         | didibus wrote:
         | Yeah, but I think it's even more so, they're arguing that you
         | should model the fact that something has been validated or not,
         | and functions should indicate if they expect a validated form
         | of input or not.
         | In that sense, using types is only one way to do this, but you
         | could model that in other ways. For example:
         | var foo = "123"         foo = validFoo(foo)         print(foo)
         | > {"value" : "123",            "valid?" : true}
         | And now you could have:                   function
         | bar(validFoo) {           if (!validFoo.get("valid?"))
         | throw new InvalidInputException("Foo must be validated prior to
         | calling bar.")           ...
         | }
         | Now types are a convenient way to do this that also gives you
         | static checking for it, but I believe the idea is more to model
         | that things were validated and expects validated input or fail.
         | That allows you to push all validation at the boundary, and
         | make sure that no one ever forgets to validate the input,
         | because if they do, the inner functions will fail reminding the
         | caller: Please remember to validate this!
           | billytetrud wrote:
           | Makes sense. It just seems like parsing is kind of a separate
           | issue and shouldn't be entangled with the concept of input
           | validation.
         | alserio wrote:
         | I mean, yes, the point is that it is a better place to do the
         | validation step. Also, parse is generic to mean from a
         | representation to a more structured one.
       | samatman wrote:
       | As a minor point of order, the exact phrase "parse, don't
       | validate" has been conventional wisdom in langsec circles since I
       | got involved, so 2014 at the earliest.
       | I asked around on the work Matrix as to who actually coined it,
       | but it's the weekend.
       | This is not to take anything away from @lexi_lambda, who cited
       | her sources and documented an interesting type-theoretic approach
       | to applying the principle. She did a great job!
       | If anyone wants to do a deeper dive, look into langsec, language-
       | theoretic security. There's a lot of prior art to explore.
       | alex_duf wrote:
       | I think this can be summarised by "model your domain by using
       | types, then let the compiler ensure you're not doing anything
       | silly"
       | Waterluvian wrote:
       | I developed a pattern in typescript (I'm sure it's not original)
       | where I have an interface describing an API entity and a class of
       | the same name with only static methods, one of which is
       | Foo.fromApi() that validates and parses.
       | I haven't seen any need to bring a library in to handle this.
       | Though it would be nice to marry the worlds of TS, API, and Json
       | Schema.
         | lhnz wrote:
         | It doesn't use json schema but you might be interested in
         | something like this: https://gcanti.github.io/io-ts/
         | (You can define runtime encoder/decoders which produce typed
         | values.)
           | brundolf wrote:
           | io-ts is fantastic (I linked it myself above). The killer
           | feature is that it infers the static types of your runtime
           | schemas for you, so you don't have to define them twice. You
           | make a change to the schema, the rest of your code will
           | typecheck against it.
       | smnrchrds wrote:
       | Fun fact: Parsix was the name of a Linux distro optimized for
       | Persian speakers.
       | didibus wrote:
       | I have to be honest, I'm not seeing what problem this is trying
       | to solve. Anyone can enlighten me?
       | Edit: Ok I think I understand...
       | It seems the problem would be that if you're implementing a
       | function that takes a user email as a string, and that function
       | is in a lower layer of the application, like inside the data
       | access layer. It is difficult at this point to know if the email
       | string you will be passed as input has already been validated or
       | not. Thus you might be tempted to re-implement validation for it
       | at your level inside this function as well and have an
       | assertValidEmail check.
       | This can lead to a littering of validation throughout the code
       | base, as each implemented function worries that the input isn't
       | validated and re-validates it, possibly using slightly different
       | rules each time.
       | Furthermore, if you decide to not validate it again, you might be
       | left wondering, but am I sure it'll have been validated prior?
       | How can I be sure? Someone in the future could easily start
       | calling my function and forget to validate the email before
       | calling it? This could eventually lead to a security issue or
       | just a bug, by introducing a code path that doesn't ever validate
       | the email string.
       | Thus if instead you'd re-write your function so it takes an email
       | as a ValidEmail type (or object), and not as a string, you force
       | the caller to for sure remember to validate the email first. And
       | you also can safely assume if you're getting an email as a
       | ValidEmail type that it has been validated. It could also
       | technically allow you to localize the validation logic to the
       | ValidEmail type constructor, avoiding possible duplicate attempts
       | at validating email with different rules.
       | And it seems the latter "style" the author calls "Parsing" while
       | the former they call "Validating", in the sense that since the
       | function validating returns a modified structure it "parsed" it,
       | because a string became a ValidEmail, thus parsing a string into
       | a ValidEmail, as opposed to simply validating that the string is
       | valid as an email.
       | And finally, this is a little library to help make use of this
       | pattern in Kotlin.
         | jinwoo68 wrote:
         | As they said in README, it's inspired by Alex King's Parse,
         | don't validate [1].
         | Basically, rather than write a validation function, write a
         | parser that returns a result of a specific type and use that
         | type everywhere else. Then you can make sure the raw inputs are
         | always validated.
         | [1] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-
         | va...
           | anaerobicover wrote:
           | Small correcting, the author is named Alex _is_
             | jinwoo68 wrote:
             | Whoops. Thanks for correcting.
         | steventhedev wrote:
         | There's an entire class of vulnerabilities caused by having
         | separate verification and parsing logic, typically with fields
         | that usually only one is used, but the format supports
         | multiple. The verifier checks the first one but the parser uses
         | the last one.
         | Smaug123 wrote:
         | If you mean "why Parse, Don't Validate", you should read the
         | original blog post, linked at the top of the article. It's...
         | transformative, if you aren't already aware of the principle.
         | https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
         | If you mean "why this library", well, I guess parser
         | combinators are nice! Some may say that a declarative statement
         | of the parsing restrictions is better than a procedural
         | implementation, on general principles.
         | finnh wrote:
         | It encourages people to use strongly typed classes rather than
         | primitives, even if the type simply wraps a primitive.
         | As a result you can't pass a invalid (say) accountID deep into
         | your code, bc validity is guaranteed to be checked early when
         | you "parse" an input string into the "AccountId" type.
         | So: internal interfaces defined using non-primitive types, so
         | internal methods don't need to keep validating their input.
         | Conversion to said types happens early and predictably,
         | catching bad values before they (eg) hit the database.
         | coderintherye wrote:
         | The linked blog post explains it pretty well. Essentially, it
         | seems to be solving for unexpected cases or incorrect
         | validation by using of static typing and passing the expected
         | type back in the return rather than a boolean. I'm not sure
         | I've encountered enough issues with validation functions to use
         | this pattern, but it does seem like a more robust way of
         | writing them.
         | jpeloquin wrote:
         | Paraphrased from the repo's readme: Suppose you have a program
         | that consumes user input. Users often give bad input so the
         | program needs to validate user input before acting on it. One
         | way to validate is to call a function (e.g., `check_input`) on
         | the user input and if it doesn't raise an error the input is
         | safe for consumption by the rest of the program. The repo
         | author considers this approach to be risky because the
         | programmer can inadvertently omit or bypass `check_input` and
         | the program still compiles and runs without complaint.
         | The repo presents an alternative validation approach, which is
         | to parse the user input into a data type (or, not quite
         | equivalently, into a class). The parsing process serves as
         | validation. Consumer functions are written such that they only
         | accept the parsed data type. Therefore it is now impossible for
         | the programmer to inadvertently omit or bypass validation of
         | user input.
         | The library is a set of convenience functions for actually
         | writing these parsing / validation functions.
           | _jal wrote:
           | Reminds me a little of taint checking in Perl and Ruby, in
           | reverse.
           | atoav wrote:
           | So in short: instead of representing user input (e.g. a Email
           | address) as a string - which you can forget to validate - the
           | idea here is to create a own data type for it, and use the
           | validation step to create said data type.
           | The rest of your program then works with this data type
           | instead of the string and this way you will get a type error
           | whenever you accidentally use unvalidated data.
           | A nice idea that goes into a similar direction is to expand
           | on this and create more types for different levels of trust.
           | E.g. you could have the data types ValidatedEmail,
           | VerifiedEmail and TrustedEmail and define precisely how one
           | becomes the other. This way your typesystem will already tell
           | you what is valid and what is not and you can't accidental
           | mix them up.
             | TeMPOraL wrote:
             | You can also further generalize this idea by noticing you
             | can encode all kinds of life cycle information in your type
             | system. As you transform some data in a sequence of steps,
             | you can use types to document and enforce the steps are
             | always executed in order.
             | In this example, the user input validation step is
             | f(String) -> ValidatedEmail, then the process of verifying
             | it is f(ValidatedEmail) -> VerifiedEmail. But the same
             | principle can apply to e.g. append() operation being
             | f(List[T], T) -> NonEmptyList[T], and you can write code
             | accepting NonEmptyList to save yourself an emptiness check.
             | Or, take a multi-step algorithm that gets a list of users,
             | filters them by some criterion, sorts the list, and sends
             | these users e-mails. Type-wise, it's a flow of Users ->
             | EligibleUsers -> SortedEligibleUsers ->
             | ContactedEligibleUsers.
             | And then, why should types be singular anyway? You should
             | be able to tag data with properties, and then filter on or
             | transform a subtag of them. This is the area of theory I'm
             | not familiar with yet, but I imagine you _should_ be able
             | to do things like:
             | List[User] -> List[User, NonEmpty] -> List[User[Eligible],
             | NonEmpty] -> List[User[Eligible], NonEmpty, Sorted[Asc]] ->
             | List[User[Contacted], Sorted[Asc]].
             | Or,
             | Email -> Email[Validated] -> Email[Validated, Verified] ->
             | Email[Validated, Verified, Trusted].
             | I'm sure there's a programming language that does that, and
             | then there's probably lots of reasons that this doesn't
             | work in practice. I'd love to know about them, as I haven't
             | encountered anything like it in practice, except bits and
             | pieces of compiler code that can sometimes propagate such
             | information "in the background", for optimization and
             | correctness checking.
           | _greim_ wrote:
           | To keep building on this, I think the word "parsing" is just
           | the tip of the iceberg. Parsing is one way to port data
           | across a type boundary, where the source and dest types are
           | optimized for different use cases (e.g. serialization vs
           | type-safe representation). Since the semantic Venn diagrams
           | of any two types might have areas of non-overlap, parse-
           | don't-validate means establishing clear boundaries in your
           | program where those translations happen, then defining the
           | types on either side of the boundary to rule out the
           | possibility of nonsense states elsewhere throughout the
           | program. The idea of nonsense states is closely related and
           | discussed more here[0] and here[1].
           | [0] http://blog.jenkster.com/2016/06/how-elm-slays-a-ui-
           | antipatt...
           | [1] https://kentcdodds.com/blog/make-impossible-states-
           | impossibl...
         | StreamBright wrote:
         | Interesting naming. Strongly typed languages (especially in the
         | ML family) have best practices that include using types instead
         | of strings as function parameters. Email type itself is enough
         | to skip validation in each function accepting that particular
         | type.
         | I think this is great first step using functional languages but
         | you can go much much deeper than that.
         | https://www.slideshare.net/ScottWlaschin/the-power-of-compos...
         | cle wrote:
         | There are lots of siblings explaining why "parse don't
         | validate".
         | But also, it's not always wise to take this to an extreme. I've
         | seen over the years many scenarios where dev teams were over-
         | enthusiastic about this and parsed themselves into a corner by
         | making system components over-strict and enforcing invariants
         | that weren't necessary to enforce, making them much harder to
         | change later.
         | The right answer is, of course, somewhere in the middle, and
         | depends on your domain and situation.
           | Iazel wrote:
           | hi, @cle! Curious to hear more about that, were they actually
           | running validation/assertions in constructors?
             | cle wrote:
             | That can be a case of that yeah. Using your example, a lot
             | of devs might use that email parsing logic in various
             | independent components of the same system. Eg if you have a
             | reporting component that sends you business reports, that
             | component really shouldn't be validating the structure of
             | email addresses...if you need to refine the parsing logic
             | now you've got to do coordinated deployments, possibly
             | backfills, etc., whereas if you just treated it as an
             | opaque string in that system you'd be better off.
             | This isn't really a criticism of the approach, it's super
             | useful, just that it needs to be applied judiciously.
             | "Parse all the things" isn't always the best advice.
         | Iazel wrote:
         | Cool to see you perfectly got the point in the end! I wonder
         | though, were you confused by the README? What made it clear for
         | you?
           | didibus wrote:
           | Hum, it was the people here who replied to my question, and
           | also reading the linked article.
           | I think my confusion was in trying to frame things as parsing
           | VS validating. While I now appreciate that use of word, now
           | that I understand, it also caused my biggest source of
           | confusion.
           | That's because I think most people think of parsing as
           | conversion, like I turn a String to an Int. Where as in your
           | case, you're simply wanting to tag a type as having been
           | validated, but you don't really convert the type itself, so
           | you simply wrap it in another type in order to tag it as
           | having been validated simply because the language offers no
           | other way to tag the type with meta-information for the
           | compiler to assert statically.
           | So because it seemed more like you're just wrapping the
           | input, but still all code will be using the input value as it
           | is, extracting it out of your wrapped type, the idea that you
           | were "Parsing" and not "Validating" well just confused me.
         | mirekrusin wrote:
         | Imagine you're writing typescript project. You type everything
         | and have type safety. This type safety is an illusion on I/O
         | boundaries - whenever ie. JSON.parse(...) from
         | file/websocket/http happens. To preserve type safety, you want
         | to use something like [0] to do runtime type assertions. Once
         | i/o boundaries are parsing unknown types at runtime into what
         | is defined as static types, your type safety is guaranteed.
         | [0] https://github.com/appliedblockchain/assert-combinators
         | rdedev wrote:
         | I find this approach combined with phantom data types really
         | cool. Now you can easily introduce a semantic differentiation
         | between two instances of the same data type but without much
         | overhead
         | GordonS wrote:
         | If it helps, here's a related blog post but with a C# slant:
         | https://andrewlock.net/using-strongly-typed-entity-ids-to-av...
         | The author refers to using primitives everywhere as "primitive
         | obsession", and proposed using types instead.
           | dmux wrote:
           | Similar to the idea of "microtypes" (I've most often seen it
           | used in Java circles):
           | https://www.markhneedham.com/blog/2009/03/10/oo-micro-types/
         | matheusmoreira wrote:
         | This also has security implications. The input handling layer
         | is critical. Bugs in parsing and validation code are
         | responsible for a huge number of vulnerabilities.
         | More details: http://langsec.org/
         | > The Language-theoretic approach (LANGSEC) regards the
         | Internet insecurity epidemic as a consequence of _ad hoc_
         | programming of input handling at all layers of network stacks,
         | and in other kinds of software stacks.
         | > LANGSEC posits that the only path to trustworthy software
         | that takes untrusted inputs is treating all valid or expected
         | inputs as a formal language, and the respective input-handling
         | routines as a _recognizer_ for that language.
         | TheAceOfHearts wrote:
         | Refining types so they encode all desired constraints before
         | use. This is explained in the linked article: Parse, don't
         | validate [0].
         | It helps reduce the risk of using invalid inputs by
         | representing constraints over the value as part of the type.
         | For example: a common problem in web development security is
         | that query parameters aren't properly validated which can lead
         | to denial of service attacks. As a trivial example of this,
         | consider a web server which paginates some data using "offset"
         | and "limit" by passing those parameters directly to a database
         | query; an attacker could set "limit" to some incredibly high
         | value and cause the server to crash. If you're just doing
         | validation on your inputs it's possible that some usage could
         | end up being overlooked.
         | [0] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-
         | va...
           | gregors wrote:
           | So real question - in the "offset" "limit" example what makes
           | it any more safe if at first the programmer sets those types
           | to be integers? The same problem persists does it not?
           | Does the explicit creation of a type add this introspection?
           | I'm not convinced that it does. Now once you fix this bug,
           | encoding it in a type prevents it from creeping into other
           | parts of the the code. This seems more like DRY principles in
           | action.
             | TheAceOfHearts wrote:
             | Apologies if I did a poor job of explaining, what you wrote
             | seems in agreement with what I was attempting to convey.
             | If one were only using integer types then the same problem
             | would persist, that's correct. The problem would be solved
             | by defining our limit type to only represent positive
             | integers up to a specific safe value.
             | Type refinement is done on the input boundaries of the
             | system during runtime to prevent errors from propagating.
             | didibus wrote:
             | Yeah, it seems to be more about guarantees as a code base
             | grows larger and more people touch it.
             | If there's a Limit class whose constructor and setter all
             | check that the range is between say 5 to 100, and all
             | existing code that needs the limit uses the instance of
             | Limit, it just becomes less likely a code change is made
             | that uses the limit input as it was directly provided by
             | the user (and thus possibly out of range).
             | But you'd still need to have had someone be smart enough to
             | make sure the Limit class does prevent limits that could
             | cause DB crashes.
             | In practice I'm thinking, ok, so someone must have
             | thought... Hey we should validate this user input and put
             | in some logic for it.
             | So I think what this says is, validation works by having
             | all external input validated as they are received. But it
             | can be easy to make a code change at the boundary where you
             | forget to add proper validation. If all existing functions
             | in the lower layers, like in the data access layer, are
             | designed to take a Limit object, the person who took a
             | limit as external input and was about to pass it to the
             | query function will get a compile error and realize... Oh I
             | need to first parse my integer limit into a Limit, and thus
             | reminds them to use the thing that enforces the valid
             | range.
             | If instead the code had a util function called
             | assertValidLimit, and the query function took a limit as an
             | integer, it be easy for that person to forget to add a call
             | to assertValidLimit when getting the limit from the user
             | and then pass that unvalidated to the query and possibly
             | cause a vulnerability.
             | And lastly, it seems they argue, if you were to validate
             | instead in the query function itself, thus it wouldn't
             | matter if others forget to validate since where it matters
             | would, but then it is hard to fail at that layer, since you
             | might have already made other changes and that can leave
             | your state corrupted.
             | So basically it seems the argument is:
             | "It is best to validate external input at the boundary as
             | soon as it is received, but it can be easy to forget to do
             | so and that's dangerous. So to help you not forget, have
             | all implementing functions take a different type then the
             | type of the external input, which will remind people... Oh
             | right I need to parse this thing first and in doing so
             | assert it's valid as well.
               | Iazel wrote:
               | Well said! I would only like to add that I highly
               | discourage adding validations/assertions in the actual
               | data class, this often make them hard to work with and
               | reuse. It is better to have this parsing logic as a
               | simple function, perhaps at factory level if you prefer
               | that kind of flavor :)
             | mbildner wrote:
             | This is not yet possible in Typescript, but imagine if you
             | could define a numerical subtype that requires your input
             | be below some threshold eg:
             | `type Limit = 0..100;`
             | See discussion here:
             | https://github.com/Microsoft/TypeScript/issues/15480
       | twic wrote:
       | Great, but why do you need a library for this? I just write
       | classes with a falliable static parse method and a private
       | constructor.
       | It looks like this library was written by someone labouring under
       | the mistaken belief that it's better to build and use a DSL to
       | create the illusion of declarativity than to just write a line or
       | two of normal code (eg the focusedParse stuff).
       | Also, i demur somewhat at calling this parsing. It's tracking
       | validation using typestate.
       | skybrian wrote:
       | This library seems to be providing a framework and doesn't
       | include any interesting parsers. (There is no email address
       | parser, despite the example.) It seems to allow for some
       | composition of parsers, but the basic idea is a design pattern
       | that's simple enough that it doesn't obviously require a
       | framework.
       | So it seems like most of the value comes from standardizing on
       | domain types like Username, Email, and so on. Using a framework
       | doesn't get you there, and it adds a dependency on the framework.
         | Iazel wrote:
         | Hi skybrian, would you mind explaining why do you see this as a
         | framework?
         | About missing interesting parsers, you are right, for now only
         | the core part is done. Based on community interest, we will
         | work on complementary packages, like more common parsers, easy
         | integration with a web framework like ktor, effectful parsers
         | based on coroutine, etc...
         | Lots of work ahead :D
       | throwawayboise wrote:
       | I do as much of this as I can with database constraints. Foreign
       | key constraints, or check constraints, or even triggers if
       | necessary (though I do try to avoid them).
       | Databases tend to outlive application code, or may be fronted by
       | different applications (internal vs external for example).
       | Keeping the constraints with the data is the best way to ensure
       | that your data remains consistent within itself.
         | Iazel wrote:
         | I see, this is also an interesting approach and definitely have
         | its usages. Thinking about it, though, it has its own
         | limitations when it comes to scalability and business
         | requirements naturally out from the database box, eg: how would
         | you ensure an S3 file reference is actually valid and it does
         | exist?
         | jhardy54 wrote:
         | I do this too, but I'm always frustrated by the mismatch
         | between database constraints and application constraints. For
         | example, when using Django you can declare a field as
         | varchar(32) but that constraint isn't checked until you
         | actually insert the row into the database. I suppose maybe
         | that's not a problem in languages with more mature type safety
         | ecosystems?
           | Iazel wrote:
           | Yeah, I've also worked with weak type systems in the past too
           | (PHP, Ruby, JS), so I can definitely share the pain! I
           | learned the hard way how much easier it is to build complex
           | systems when you have a compiler helping you ;)
             | jhardy54 wrote:
             | What are you building with now? Rust/Go/something snazzy?
       (page generated 2021-05-15 23:00 UTC)