[HN Gopher] Parsix: Parse Don't Validate ___________________________________________________________________ Parsix: Parse Don't Validate Author : Iazel Score : 105 points Date : 2021-05-15 15:38 UTC (7 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | ledauphin wrote: | I've been looking for a solid Typescript implementation of "parse | don't validate" that performs runtime parsing using semantics | attached to the defined Typescript types themselves. In other | words, much like attrs for Python, I want to be able to define a | low/no-boilerplate type, and then register parsers for those | types that will work recursively to parse my data, resulting in | the specified Typescript type. | | Has anyone seen or written something like this? | iddan wrote: | It is definitely possible as Flowtype got it right. I hope one | day it will come to TypeScript as well | brundolf wrote: | Use io-ts: https://github.com/gcanti/io-ts | | You define a decoder schema, and then the resulting TS type | gets automatically derived for you. You can then run data | through the decoder, it will err if there's a mismatch, or | return a value of the inferred type otherwise. | bmuon wrote: | I've been using this small library inspired by Elm/Swift | decoders [1]. It works, but it's not low boilerplate. | | I'm gravitating towards GraphQL now because strict parsing is | built into it, so there is no need for all this boilerplate. | | https://www.npmjs.com/package/@mojotech/json-type-validation | ledauphin wrote: | we use GraphQL for this purpose as well, but I'd also like to | be able to validate across other boundaries. | | However, as I'm saying this, I wonder if I've been looking at | this problem wrong. Since we already generate types from | GraphQL schemas, maybe I should figure out how to use the | same client side parser that's already in my GraphQL client, | define a GraphQL schema for the types I'm interested in, and | then just generate and use those types. | | One thing that doesn't necessarily give me is the ability to | define custom parsers corresponding to custom types. At | least, I think most of that sort of thing is usually done | server side with GraphQL. | | So, thank you for the link and also the inspiration for | considering an alternative. | renke1 wrote: | Not exactly what you want, I think, but there is zod [0]. | | I really would like to see nominal typing support in | TypeScript. Currently, it's hard to validate a piece of data | (or parse for that matter) once and have other functions only | operate on that validated data. There are (ugly?) workarounds | though [1]. | | [0]: https://github.com/colinhacks/zod [1]: | https://gist.github.com/dcolthorp/aa21cf87d847ae9942106435bf... | brundolf wrote: | Similar thing for TypeScript: https://github.com/gcanti/io-ts | billytetrud wrote: | To me this just looks like they're arguing for using class types | rather than raw strings. The parsing seems kind of orthogonal and | a special case of the kinds of validation you might want to do. | | It's also misleading in that the code is still doing validation, | just in a different place. | Iazel wrote: | Yes, it is basically a combination of proofing some data has | been validated by encoding this proof in a specific type, like | Email :) We want to popularize this idea and make it easier to | work with it by offering some nice, type-safe abstraction. | didibus wrote: | Yeah, but I think it's even more so, they're arguing that you | should model the fact that something has been validated or not, | and functions should indicate if they expect a validated form | of input or not. | | In that sense, using types is only one way to do this, but you | could model that in other ways. For example: | var foo = "123" foo = validFoo(foo) print(foo) | > {"value" : "123", "valid?" : true} | | And now you could have: function | bar(validFoo) { if (!validFoo.get("valid?")) | throw new InvalidInputException("Foo must be validated prior to | calling bar.") ... | | } | | Now types are a convenient way to do this that also gives you | static checking for it, but I believe the idea is more to model | that things were validated and expects validated input or fail. | | That allows you to push all validation at the boundary, and | make sure that no one ever forgets to validate the input, | because if they do, the inner functions will fail reminding the | caller: Please remember to validate this! | billytetrud wrote: | Makes sense. It just seems like parsing is kind of a separate | issue and shouldn't be entangled with the concept of input | validation. | alserio wrote: | I mean, yes, the point is that it is a better place to do the | validation step. Also, parse is generic to mean from a | representation to a more structured one. | samatman wrote: | As a minor point of order, the exact phrase "parse, don't | validate" has been conventional wisdom in langsec circles since I | got involved, so 2014 at the earliest. | | I asked around on the work Matrix as to who actually coined it, | but it's the weekend. | | This is not to take anything away from @lexi_lambda, who cited | her sources and documented an interesting type-theoretic approach | to applying the principle. She did a great job! | | If anyone wants to do a deeper dive, look into langsec, language- | theoretic security. There's a lot of prior art to explore. | alex_duf wrote: | I think this can be summarised by "model your domain by using | types, then let the compiler ensure you're not doing anything | silly" | Waterluvian wrote: | I developed a pattern in typescript (I'm sure it's not original) | where I have an interface describing an API entity and a class of | the same name with only static methods, one of which is | Foo.fromApi() that validates and parses. | | I haven't seen any need to bring a library in to handle this. | Though it would be nice to marry the worlds of TS, API, and Json | Schema. | lhnz wrote: | It doesn't use json schema but you might be interested in | something like this: https://gcanti.github.io/io-ts/ | | (You can define runtime encoder/decoders which produce typed | values.) | brundolf wrote: | io-ts is fantastic (I linked it myself above). The killer | feature is that it infers the static types of your runtime | schemas for you, so you don't have to define them twice. You | make a change to the schema, the rest of your code will | typecheck against it. | smnrchrds wrote: | Fun fact: Parsix was the name of a Linux distro optimized for | Persian speakers. | didibus wrote: | I have to be honest, I'm not seeing what problem this is trying | to solve. Anyone can enlighten me? | | Edit: Ok I think I understand... | | It seems the problem would be that if you're implementing a | function that takes a user email as a string, and that function | is in a lower layer of the application, like inside the data | access layer. It is difficult at this point to know if the email | string you will be passed as input has already been validated or | not. Thus you might be tempted to re-implement validation for it | at your level inside this function as well and have an | assertValidEmail check. | | This can lead to a littering of validation throughout the code | base, as each implemented function worries that the input isn't | validated and re-validates it, possibly using slightly different | rules each time. | | Furthermore, if you decide to not validate it again, you might be | left wondering, but am I sure it'll have been validated prior? | How can I be sure? Someone in the future could easily start | calling my function and forget to validate the email before | calling it? This could eventually lead to a security issue or | just a bug, by introducing a code path that doesn't ever validate | the email string. | | Thus if instead you'd re-write your function so it takes an email | as a ValidEmail type (or object), and not as a string, you force | the caller to for sure remember to validate the email first. And | you also can safely assume if you're getting an email as a | ValidEmail type that it has been validated. It could also | technically allow you to localize the validation logic to the | ValidEmail type constructor, avoiding possible duplicate attempts | at validating email with different rules. | | And it seems the latter "style" the author calls "Parsing" while | the former they call "Validating", in the sense that since the | function validating returns a modified structure it "parsed" it, | because a string became a ValidEmail, thus parsing a string into | a ValidEmail, as opposed to simply validating that the string is | valid as an email. | | And finally, this is a little library to help make use of this | pattern in Kotlin. | jinwoo68 wrote: | As they said in README, it's inspired by Alex King's Parse, | don't validate [1]. | | Basically, rather than write a validation function, write a | parser that returns a result of a specific type and use that | type everywhere else. Then you can make sure the raw inputs are | always validated. | | [1] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t- | va... | anaerobicover wrote: | Small correcting, the author is named Alex _is_ | jinwoo68 wrote: | Whoops. Thanks for correcting. | steventhedev wrote: | There's an entire class of vulnerabilities caused by having | separate verification and parsing logic, typically with fields | that usually only one is used, but the format supports | multiple. The verifier checks the first one but the parser uses | the last one. | Smaug123 wrote: | If you mean "why Parse, Don't Validate", you should read the | original blog post, linked at the top of the article. It's... | transformative, if you aren't already aware of the principle. | https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va... | | If you mean "why this library", well, I guess parser | combinators are nice! Some may say that a declarative statement | of the parsing restrictions is better than a procedural | implementation, on general principles. | finnh wrote: | It encourages people to use strongly typed classes rather than | primitives, even if the type simply wraps a primitive. | | As a result you can't pass a invalid (say) accountID deep into | your code, bc validity is guaranteed to be checked early when | you "parse" an input string into the "AccountId" type. | | So: internal interfaces defined using non-primitive types, so | internal methods don't need to keep validating their input. | Conversion to said types happens early and predictably, | catching bad values before they (eg) hit the database. | coderintherye wrote: | The linked blog post explains it pretty well. Essentially, it | seems to be solving for unexpected cases or incorrect | validation by using of static typing and passing the expected | type back in the return rather than a boolean. I'm not sure | I've encountered enough issues with validation functions to use | this pattern, but it does seem like a more robust way of | writing them. | jpeloquin wrote: | Paraphrased from the repo's readme: Suppose you have a program | that consumes user input. Users often give bad input so the | program needs to validate user input before acting on it. One | way to validate is to call a function (e.g., `check_input`) on | the user input and if it doesn't raise an error the input is | safe for consumption by the rest of the program. The repo | author considers this approach to be risky because the | programmer can inadvertently omit or bypass `check_input` and | the program still compiles and runs without complaint. | | The repo presents an alternative validation approach, which is | to parse the user input into a data type (or, not quite | equivalently, into a class). The parsing process serves as | validation. Consumer functions are written such that they only | accept the parsed data type. Therefore it is now impossible for | the programmer to inadvertently omit or bypass validation of | user input. | | The library is a set of convenience functions for actually | writing these parsing / validation functions. | _jal wrote: | Reminds me a little of taint checking in Perl and Ruby, in | reverse. | atoav wrote: | So in short: instead of representing user input (e.g. a Email | address) as a string - which you can forget to validate - the | idea here is to create a own data type for it, and use the | validation step to create said data type. | | The rest of your program then works with this data type | instead of the string and this way you will get a type error | whenever you accidentally use unvalidated data. | | A nice idea that goes into a similar direction is to expand | on this and create more types for different levels of trust. | E.g. you could have the data types ValidatedEmail, | VerifiedEmail and TrustedEmail and define precisely how one | becomes the other. This way your typesystem will already tell | you what is valid and what is not and you can't accidental | mix them up. | TeMPOraL wrote: | You can also further generalize this idea by noticing you | can encode all kinds of life cycle information in your type | system. As you transform some data in a sequence of steps, | you can use types to document and enforce the steps are | always executed in order. | | In this example, the user input validation step is | f(String) -> ValidatedEmail, then the process of verifying | it is f(ValidatedEmail) -> VerifiedEmail. But the same | principle can apply to e.g. append() operation being | f(List[T], T) -> NonEmptyList[T], and you can write code | accepting NonEmptyList to save yourself an emptiness check. | Or, take a multi-step algorithm that gets a list of users, | filters them by some criterion, sorts the list, and sends | these users e-mails. Type-wise, it's a flow of Users -> | EligibleUsers -> SortedEligibleUsers -> | ContactedEligibleUsers. | | And then, why should types be singular anyway? You should | be able to tag data with properties, and then filter on or | transform a subtag of them. This is the area of theory I'm | not familiar with yet, but I imagine you _should_ be able | to do things like: | | List[User] -> List[User, NonEmpty] -> List[User[Eligible], | NonEmpty] -> List[User[Eligible], NonEmpty, Sorted[Asc]] -> | List[User[Contacted], Sorted[Asc]]. | | Or, | | Email -> Email[Validated] -> Email[Validated, Verified] -> | Email[Validated, Verified, Trusted]. | | I'm sure there's a programming language that does that, and | then there's probably lots of reasons that this doesn't | work in practice. I'd love to know about them, as I haven't | encountered anything like it in practice, except bits and | pieces of compiler code that can sometimes propagate such | information "in the background", for optimization and | correctness checking. | _greim_ wrote: | To keep building on this, I think the word "parsing" is just | the tip of the iceberg. Parsing is one way to port data | across a type boundary, where the source and dest types are | optimized for different use cases (e.g. serialization vs | type-safe representation). Since the semantic Venn diagrams | of any two types might have areas of non-overlap, parse- | don't-validate means establishing clear boundaries in your | program where those translations happen, then defining the | types on either side of the boundary to rule out the | possibility of nonsense states elsewhere throughout the | program. The idea of nonsense states is closely related and | discussed more here[0] and here[1]. | | [0] http://blog.jenkster.com/2016/06/how-elm-slays-a-ui- | antipatt... | | [1] https://kentcdodds.com/blog/make-impossible-states- | impossibl... | StreamBright wrote: | Interesting naming. Strongly typed languages (especially in the | ML family) have best practices that include using types instead | of strings as function parameters. Email type itself is enough | to skip validation in each function accepting that particular | type. | | I think this is great first step using functional languages but | you can go much much deeper than that. | | https://www.slideshare.net/ScottWlaschin/the-power-of-compos... | cle wrote: | There are lots of siblings explaining why "parse don't | validate". | | But also, it's not always wise to take this to an extreme. I've | seen over the years many scenarios where dev teams were over- | enthusiastic about this and parsed themselves into a corner by | making system components over-strict and enforcing invariants | that weren't necessary to enforce, making them much harder to | change later. | | The right answer is, of course, somewhere in the middle, and | depends on your domain and situation. | Iazel wrote: | hi, @cle! Curious to hear more about that, were they actually | running validation/assertions in constructors? | cle wrote: | That can be a case of that yeah. Using your example, a lot | of devs might use that email parsing logic in various | independent components of the same system. Eg if you have a | reporting component that sends you business reports, that | component really shouldn't be validating the structure of | email addresses...if you need to refine the parsing logic | now you've got to do coordinated deployments, possibly | backfills, etc., whereas if you just treated it as an | opaque string in that system you'd be better off. | | This isn't really a criticism of the approach, it's super | useful, just that it needs to be applied judiciously. | "Parse all the things" isn't always the best advice. | Iazel wrote: | Cool to see you perfectly got the point in the end! I wonder | though, were you confused by the README? What made it clear for | you? | didibus wrote: | Hum, it was the people here who replied to my question, and | also reading the linked article. | | I think my confusion was in trying to frame things as parsing | VS validating. While I now appreciate that use of word, now | that I understand, it also caused my biggest source of | confusion. | | That's because I think most people think of parsing as | conversion, like I turn a String to an Int. Where as in your | case, you're simply wanting to tag a type as having been | validated, but you don't really convert the type itself, so | you simply wrap it in another type in order to tag it as | having been validated simply because the language offers no | other way to tag the type with meta-information for the | compiler to assert statically. | | So because it seemed more like you're just wrapping the | input, but still all code will be using the input value as it | is, extracting it out of your wrapped type, the idea that you | were "Parsing" and not "Validating" well just confused me. | mirekrusin wrote: | Imagine you're writing typescript project. You type everything | and have type safety. This type safety is an illusion on I/O | boundaries - whenever ie. JSON.parse(...) from | file/websocket/http happens. To preserve type safety, you want | to use something like [0] to do runtime type assertions. Once | i/o boundaries are parsing unknown types at runtime into what | is defined as static types, your type safety is guaranteed. | | [0] https://github.com/appliedblockchain/assert-combinators | rdedev wrote: | I find this approach combined with phantom data types really | cool. Now you can easily introduce a semantic differentiation | between two instances of the same data type but without much | overhead | GordonS wrote: | If it helps, here's a related blog post but with a C# slant: | | https://andrewlock.net/using-strongly-typed-entity-ids-to-av... | | The author refers to using primitives everywhere as "primitive | obsession", and proposed using types instead. | dmux wrote: | Similar to the idea of "microtypes" (I've most often seen it | used in Java circles): | | https://www.markhneedham.com/blog/2009/03/10/oo-micro-types/ | matheusmoreira wrote: | This also has security implications. The input handling layer | is critical. Bugs in parsing and validation code are | responsible for a huge number of vulnerabilities. | | More details: http://langsec.org/ | | > The Language-theoretic approach (LANGSEC) regards the | Internet insecurity epidemic as a consequence of _ad hoc_ | programming of input handling at all layers of network stacks, | and in other kinds of software stacks. | | > LANGSEC posits that the only path to trustworthy software | that takes untrusted inputs is treating all valid or expected | inputs as a formal language, and the respective input-handling | routines as a _recognizer_ for that language. | TheAceOfHearts wrote: | Refining types so they encode all desired constraints before | use. This is explained in the linked article: Parse, don't | validate [0]. | | It helps reduce the risk of using invalid inputs by | representing constraints over the value as part of the type. | | For example: a common problem in web development security is | that query parameters aren't properly validated which can lead | to denial of service attacks. As a trivial example of this, | consider a web server which paginates some data using "offset" | and "limit" by passing those parameters directly to a database | query; an attacker could set "limit" to some incredibly high | value and cause the server to crash. If you're just doing | validation on your inputs it's possible that some usage could | end up being overlooked. | | [0] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t- | va... | gregors wrote: | So real question - in the "offset" "limit" example what makes | it any more safe if at first the programmer sets those types | to be integers? The same problem persists does it not? | | Does the explicit creation of a type add this introspection? | I'm not convinced that it does. Now once you fix this bug, | encoding it in a type prevents it from creeping into other | parts of the the code. This seems more like DRY principles in | action. | TheAceOfHearts wrote: | Apologies if I did a poor job of explaining, what you wrote | seems in agreement with what I was attempting to convey. | | If one were only using integer types then the same problem | would persist, that's correct. The problem would be solved | by defining our limit type to only represent positive | integers up to a specific safe value. | | Type refinement is done on the input boundaries of the | system during runtime to prevent errors from propagating. | didibus wrote: | Yeah, it seems to be more about guarantees as a code base | grows larger and more people touch it. | | If there's a Limit class whose constructor and setter all | check that the range is between say 5 to 100, and all | existing code that needs the limit uses the instance of | Limit, it just becomes less likely a code change is made | that uses the limit input as it was directly provided by | the user (and thus possibly out of range). | | But you'd still need to have had someone be smart enough to | make sure the Limit class does prevent limits that could | cause DB crashes. | | In practice I'm thinking, ok, so someone must have | thought... Hey we should validate this user input and put | in some logic for it. | | So I think what this says is, validation works by having | all external input validated as they are received. But it | can be easy to make a code change at the boundary where you | forget to add proper validation. If all existing functions | in the lower layers, like in the data access layer, are | designed to take a Limit object, the person who took a | limit as external input and was about to pass it to the | query function will get a compile error and realize... Oh I | need to first parse my integer limit into a Limit, and thus | reminds them to use the thing that enforces the valid | range. | | If instead the code had a util function called | assertValidLimit, and the query function took a limit as an | integer, it be easy for that person to forget to add a call | to assertValidLimit when getting the limit from the user | and then pass that unvalidated to the query and possibly | cause a vulnerability. | | And lastly, it seems they argue, if you were to validate | instead in the query function itself, thus it wouldn't | matter if others forget to validate since where it matters | would, but then it is hard to fail at that layer, since you | might have already made other changes and that can leave | your state corrupted. | | So basically it seems the argument is: | | "It is best to validate external input at the boundary as | soon as it is received, but it can be easy to forget to do | so and that's dangerous. So to help you not forget, have | all implementing functions take a different type then the | type of the external input, which will remind people... Oh | right I need to parse this thing first and in doing so | assert it's valid as well. | Iazel wrote: | Well said! I would only like to add that I highly | discourage adding validations/assertions in the actual | data class, this often make them hard to work with and | reuse. It is better to have this parsing logic as a | simple function, perhaps at factory level if you prefer | that kind of flavor :) | mbildner wrote: | This is not yet possible in Typescript, but imagine if you | could define a numerical subtype that requires your input | be below some threshold eg: | | `type Limit = 0..100;` | | See discussion here: | https://github.com/Microsoft/TypeScript/issues/15480 | twic wrote: | Great, but why do you need a library for this? I just write | classes with a falliable static parse method and a private | constructor. | | It looks like this library was written by someone labouring under | the mistaken belief that it's better to build and use a DSL to | create the illusion of declarativity than to just write a line or | two of normal code (eg the focusedParse stuff). | | Also, i demur somewhat at calling this parsing. It's tracking | validation using typestate. | skybrian wrote: | This library seems to be providing a framework and doesn't | include any interesting parsers. (There is no email address | parser, despite the example.) It seems to allow for some | composition of parsers, but the basic idea is a design pattern | that's simple enough that it doesn't obviously require a | framework. | | So it seems like most of the value comes from standardizing on | domain types like Username, Email, and so on. Using a framework | doesn't get you there, and it adds a dependency on the framework. | Iazel wrote: | Hi skybrian, would you mind explaining why do you see this as a | framework? | | About missing interesting parsers, you are right, for now only | the core part is done. Based on community interest, we will | work on complementary packages, like more common parsers, easy | integration with a web framework like ktor, effectful parsers | based on coroutine, etc... | | Lots of work ahead :D | throwawayboise wrote: | I do as much of this as I can with database constraints. Foreign | key constraints, or check constraints, or even triggers if | necessary (though I do try to avoid them). | | Databases tend to outlive application code, or may be fronted by | different applications (internal vs external for example). | Keeping the constraints with the data is the best way to ensure | that your data remains consistent within itself. | Iazel wrote: | I see, this is also an interesting approach and definitely have | its usages. Thinking about it, though, it has its own | limitations when it comes to scalability and business | requirements naturally out from the database box, eg: how would | you ensure an S3 file reference is actually valid and it does | exist? | jhardy54 wrote: | I do this too, but I'm always frustrated by the mismatch | between database constraints and application constraints. For | example, when using Django you can declare a field as | varchar(32) but that constraint isn't checked until you | actually insert the row into the database. I suppose maybe | that's not a problem in languages with more mature type safety | ecosystems? | Iazel wrote: | Yeah, I've also worked with weak type systems in the past too | (PHP, Ruby, JS), so I can definitely share the pain! I | learned the hard way how much easier it is to build complex | systems when you have a compiler helping you ;) | jhardy54 wrote: | What are you building with now? Rust/Go/something snazzy? ___________________________________________________________________ (page generated 2021-05-15 23:00 UTC)