https://serokell.io/blog/parser-combinators-in-elixir

 

  * 
  * Projects
  * Labs
  * Blog
  * Team
  * Contacts
  * Hire us

Your browser seems to have problems showing our website properly so
it's switched to a simplified version.
We use cookies to personalize content and give you the best web
experience. For more details, please read our Privacy Notice
Accept
ResearchProgrammingArtificial IntelligenceInterviewsOther

Parser Combinators in Elixir

Gints Dreimanis
Article by Gints Dreimanis
April 6th, 2021
TwitterFacebookLinkedin TelegramMail WhatsAppCopy Link
Upvote blog postBlog post is upvoted
7

Parser combinators are one of the most useful tools for parsing. In
contrast to regular expressions, they are much more readable and
maintainable, making them an excellent choice for more complex tasks.

This article has two parts. First, l'll explain how parser
combinators work and what they are made of. After that, I'll guide
you through making a CSV parser using NimbleParsec, a parser
combinator library written in Elixir.

Introduction to parser combinators

In this part, I'll give a brief description of parser combinators,
and we will try to build functional parser combinators from scratch.
The combinators we'll make will be low-level and worse than what you
would get with simple regex; they are there to illustrate the point.

If you want to see parser combinators in action, go straight to the
NimbleParsec section.

What are parser combinators?

When programming, we frequently have to parse input (like a string)
into a more computer-friendly data structure (like a tree or a list).

One fast way to do that is to write a regex expression that captures
everything we need. But these can become quite lengthy and
complicated, which leads to ugly code.

What if we could instead write parsers that map one-to-one to
semantical units in the input and combine them to make a parser for
that input?

Ultimately, parser combinators are just this: a way to combine simple
parsers to create more complex parsers.

Parsers

So what do parsers do, exactly? The main goal of a parser is to parse
a string of text into a different, more structured object, like a
list or a tree.

For example, we could accept a list of integers as a string "3, 1, 4,
1" and turn that string into a list to better represent the structure
inherent in the string - [3, 1, 4, 1].

But what if we come upon a string like "3, 1, 4, 1 -- Monday,
December 28th"? Or "oops, I'm sorry"? To compose with other parsers
and handle possible failure, we also need to return the rest of the
input if the parser succeeds and an error if it doesn't.

parser

Here's an example of a low-level parser that parses one decimal digit
in Elixir.

  def parse_digit(<<char, rest::bitstring>>) when char >= 48 and char <= 57,
    do: {:ok, [char - 48], rest}
  def parse_digit(_), do: {:error, :wrong_input}

If you ask what we can do with it, the answer is: not very much.  To
unlock the power of parser combinators, we need to find a way to put
different parsers together.

Combinators

A parser combinator is a function that combines two or more parsers
into another parser.

parser combinator

Let's think about the ways we could combine parsers. The most
straightforward combination would be to link two together - make the
parser parse two decimal digits one by one.

  def concat(fun, fun2) do
    fn x ->
      case fun.(x) do
        {:ok, parsed, rest} ->
          case fun2.(rest) do
            {:ok, parsed2, rest2} ->
              {:ok, parsed ++ parsed2, rest2}

            err ->
              err
            end

        err ->
          err
      end
    end
  end

Here, the resulting parser applies the first function to the input,
then the second function to the rest of the input that the first
function returns. We return both parsed items as a list and the input
that the second function didn't consume. In case there is an error,
it just gets passed further.

concat

Now we can use our combinator repeatedly to create a parser that can
parse 2, 3, even 4, and more integers in a row! But that's just the
beginning.

There are multiple other combinator possibilities out there. A
frequent one is choice, a naive version of which could look like
this:

  def choice(fun1, fun2) do
    fn x ->
      case {fun1.(x), fun2.(x)} do
        {{:ok, parsed, rest}, _ } -> {:ok, parsed, rest}
        {_, {:ok, parsed, rest}} -> {:ok, parsed, rest}
        {err, _} -> err
      end
    end
  end

Here, it would try parsing two different parsers one by one and pick
the one that succeeds first or return an error.

Our simple combinators can make a parser that parses either two or
three numbers.

  def digit_parser() do
    fn x -> parse_digit(x) end
  end

  def two_digits() do
    digit_parser()
    |> concat(digit_parser())
  end

  def three_digits() do
    digit_parser()
    |> concat(digit_parser())
    |> concat(digit_parser())
  end

  def two_or_three_digits() do
    choice(three_digits(), two_digits())
  end

iex(1)> SimpleParser.two_or_three_digits.("55")
{:ok, [5, 5], ""}
iex(2)> SimpleParser.two_or_three_digits.("5a")
{:error, :wrong_input}

By combining different parsers, you can build large, complicated
parsers that represent the rules of languages like JSON or XML, for
example.

Real parser combinator libraries usually offer an assortment of
different combinators that make it possible to represent parsers in a
readable manner. We'll see that later in our NimbleParser example.

Error handling in parser combinators

Our preliminary error handling is rather naive, and I've been
informed that there is a misconception that parser combinators handle
errors badly. Let's see how we can easily extend our parser to show
the position of unexpected input.

First, let's change how parse_digit handles errors:

  def parse_digit(<<char, rest::bitstring>>) when char >= 48 and char <= 57,
    do: {:ok, [char - 48], rest}

  def parse_digit(<<char, _rest::bitstring>>), do: {:error, {:unexpected_input, <<char>>, 1}}
  def parse_digit(""), do: {:error, :end_of_string}
  def parse_digit(_), do: {:error, :not_string}

In addition to input errors, an EOS error can happen quite easily, so
I made sure to cover that.

Now we can modify our concat combinator to track the position of an
input error, if it occurs:

  def concat(fun, fun2) do
    fn x ->
      case fun.(x) do
        {:ok, parsed, rest} ->
          case fun2.(rest) do
            {:ok, parsed2, rest2} ->
              {:ok, parsed ++ parsed2, rest2}

            {:error, {:unexpected_input, input, pos}} ->
              {:error, {:unexpected_input, input, String.length(x) - String.length(rest) + pos}}

            err ->
              err
          end

        err ->
          err
      end
    end
  end

The choice combinator already handles these errors well. You can see
the end result here.

Now when we try to do two_or_three_digits.("5a"), we'll get {:error,
{:unexpected_input, "a", 2}}. If we expose the code as a library, we
can easily make nice error messages.

This code is, of course, for demonstration purposes only, but a
similar approach is used in megaparsec, a Haskell parser combinator
library that is renowned for its decent error reporting.

Where can you use parser combinators?

Since parser combinators are much more powerful than regex, you can
use them for parsing items with complex, recursive structures. But
they can also be used for simple parsers where an item can have a lot
of different alternatives, for example.

They do not replace regex, though. Each tool has its benefits. I
would use regex for simple scripts or one-liners and parser
combinators for most other parsing needs.

---------------------------------------------------------------------

Quick aside: I got helped a lot in this section by Jonn Mostovoy, who
recently published a hands-on guide to using parser combinators in
Rust. If you're interested in seeing how to handle them in a bare
metal language, I'd suggest checking it out.

Create your own CSV parser with NimbleParsec

NimbleParsec is a library that uses metaprogramming to provide you
with efficient parsers that compile to binary pattern matching. In
this section, we'll use it to build a simple CSV parser that will
take a CSV file and convert it into a list of lists.

Setup

First off, let's use mix new CSVParser to create a new project called
CSVParser. Afterward, add {:nimble_parsec, "~> 1.0"} to the list of
dependencies in mix.exs and import NimbleParsec in the module.

defmodule CSVParser do
  import NimbleParsec

end

Grammar

Now we have to think about the structure of a CSV file.

CSV consists of lines, each of which consists of values separated by
commas. We can probably define a CSV value and then use the
definition to define a line. Let's write out simple definitions in
English.

  * Value is a string (let's ignore numbers, escape characters, and
    floats for now).
  * Line consists of a value, then possible repeats of (comma, then
    value), then an EOL character.

How can we use the functions available in the library to reflect this
simple grammar?

Building blocks

Let's try to build individual parsers for value and line.

To implement value, we need to think about the characters that will
separate these values. One good contender would be ,, but you can
also encounter newline characters \n and \r. Values can also be
empty, so we need to provide for that.

The best fit for our goal is utf8_string, which lets us provide
several arguments such as not (which chars not to parse) and min
(minimum length).

value = utf8_string([not: ?\r, not: ?\n, not: ?,], min: 0)

Then, we need to define a line. For us, a line is a value, then a
comma and a value, repeated 0 or more times, and then an EOL
character.

line

We have the value defined, but let's quickly define an EOL parser
that covers Windows, macOS, and Linux.

  eol =
    choice([
      string("\r\n"),
      string("\n")
    ])

As we saw before, choice enables us to parse the first option that
succeeds from a list of functions.

After that, we can use the combinators ignore, concat, and repeat
together with our defined parsers to define a line.

  line =
    value
    |> repeat(ignore(string(",")) |> concat(value))
    |> ignore(eol)

ignore will ignore the character and move forward without parsing
anything, concat composes two parsers, and repeat repeats a parser
until it doesn't succeed.

Putting it together

Now that we have the line element, it is very easy to define the full
parser. To do that, we need to use the defparsec macro.

defparsec :file, line |> wrap() |> repeat(), debug: true

Here, we parse a line, wrap it in [], and repeat the process until it
doesn't succeed. Now, if we read a CSV file, CSVParser.file
(file_contents) will parse the contents of simple CSV files.

Here's all the code:

defmodule CSVParser do
  import NimbleParsec

  value = utf8_string([not: ?\r, not: ?\n, not: ?,], min: 0)

  eol =
    choice([
      string("\r\n"),
      string("\n")
    ])

  line =
    value
    |> repeat(ignore(string(",")) |> concat(value))
    |> ignore(eol)


  defparsec :file, line |> wrap() |> repeat(), debug: true
end

Hooray, a working CSV parser!

Or is it? 

Preparing for escape

Our CSV definition was very simple. It didn't cover some of the
things that can appear in CSV, and it also treats numbers and strings
the same. Arguably, we could have split the file on newline chars,
mapped a split on commas on the resulting list, and achieved the same
result.

But, since we have created a good foundation, adding new definitions
to the parser is much simpler than improving a two-line function.
Let's try to fix one of these issues now.

One of the problems is that CSV files can have commas in entries. Our
parser always splits on the comma. Let's add an option to escape
commas by wrapping the entry in double quotes.

To do that, we need to extend our value definition.

An escaped value consists of zero or more characters, surrounded by
double quotes. In case there is a double quote inside the escaped
value, the double quote will need to be escaped by another double
quote.

In other words, these are all valid options:

text
"text"
"text, text"
"text, ""text"""

First of all, our item is surrounded by double quotes.

  escaped_value =
    ignore(string("\""))
    ???
    ignore(string("\""))

Then we need to figure out how to parse the inside of the double
quotes to fulfill the requirements.

After going through a rather roundabout way to achieve this (you
don't want to know ), I found a tip in Real World Haskell that we
can just read characters of the item one by one, matching only two
double quotes in a row or a non-quote character.

This would parse one character:

  escaped_character =
    choice([
      string("\"\""),
      utf8_string([not: ?"], 1)
    ])

Now we can use the repeat combinator on escaped_character, then join
all the characters we parsed.

  escaped_value =
    ignore(string("\""))
    |> repeat(escaped_character)
    |> ignore(string("\""))
    |> reduce({Enum, :join, [""]})

Let's rename the original value to regular_value and make value a
choice between escaped_value and regular_value.

  escaped_character =
    choice([
      string("\"\""),
      utf8_string([not: ?"], 1)
    ])

  regular_value = utf8_string([not: ?\r, not: ?\n, not: ?,], min: 0)

  escaped_value =
    ignore(string("\""))
    |> repeat(escaped_character)
    |> ignore(string("\""))
    |> reduce({Enum, :join, [""]})

  value =
    choice([
      escaped_value,
      regular_value
    ])

We need to put escaped_value first because otherwise, the parser will
succeed with regular_value on our string before we get a chance to
escape.

You can see the full parser code here.

Further improvements

This parser can be further improved, of course. For example, you can
add support for extra whitespace or numbers, which is an exciting
exercise to do on your own.

I hope that this has been an exciting journey and that you learned
something new today! If you want to read more about Elixir, you are
welcome to browse our Elixir articles and follow us on Twitter, DEV
or Medium to receive updates whenever we publish new ones.

tagged:
elixirelixir tutorial
share:
       
Upvote blog postBlog post is upvoted
7 upvotes
[                    ]Suggest your topic for our next post...
Submit
Parser Combinators in Elixir

  * Introduction to parser combinators
      + What are parser combinators?
      + Parsers
      + Combinators
      + Error handling in parser combinators
      + Where can you use parser combinators?
  * Create your own CSV parser with NimbleParsec
      + Setup
      + Grammar
      + Building blocks
      + Putting it together
      + Preparing for escape
      + Further improvements

More from Serokell
Beginner's guide to ElixirBeginner's guide to Elixir
Introduction to Elixir
Elixir is a functional, dynamically typed language that is built on
top of Erlang's VM and compiles down to Erlang bytecode. In this
article, we talk about its features and benefits.
April 15th, 2020
elixir in production: plausible analyticselixir in production:
plausible analytics
Elixir in Production: Plausible Analytics
What are the benefits of Elixir for web analytics? Find out in our
interview with the founder of Plausible Analytics, a lightweight
open-source alternative to Google Analytics.
September 16th, 2020
introduction to erlangintroduction to erlang
Introduction to Erlang
Learn more about Erlang, a functional, concurrency-oriented
programming language that powers applications like WhatsApp and
WeChat.
December 23rd, 2020
(+372) 699-1531
hi@serokell.io
Pille tn 7/5-13, Kesklinna linnaosa,
Tallinn, Harju maakond, 10135, Estonia
Serokell: Rate5.0based on 9 Google Business reviews
       
Serokell LabsFintechManaged IT servicesHow we workML consulting
Privacy notice
 
(c) 2015-2021 Serokell
Close
[                    ]