Table of Contents
The task of parsing a file, or data of various types, is a common one for programmers. We already learned about Haskell's support for regular expressions back in the section called “Regular expressions in Haskell”. Regular expressions are nice for many tasks, but they rapidly become unwieldy, or cannot be used at all, when dealing with a complex data format. For instance, we cannot use regular expressions to parse source code from most programming languages.
Parsec is a useful parser combinator library, with which we combine small parsing functions to build more sophisticated parsers. Parsec provides some simple parsing functions, as well as functions to tie them all together. It should come as no surprise that this parser library for Haskell is built around the notion of functions.
It's helpful to know where Parsec fits compared to the tools used for parsing in other languages. Parsing is sometimes divided into two stages: lexical analysis (the domain of tools like flex) and parsing itself (performed by programs such as bison). Parsec can perform both lexical analysis and parsing.
Let's jump right in by writing some code for parsing a CSV file. CSV files are often used as a plain text representation of spreadsheets or databases. Each line is a record, and each field in the record is separated from the next by a comma. There are ways of dealing with fields that contain commas, but to start with, we won't worry about it.
This first example is much longer than it really needs to be. We will introduce more Parsec features in a little bit that will shrink the parser down to only four lines!
-- file: ch16/csv1.hs
import Text.ParserCombinators.Parsec
{- A CSV file contains 0 or more lines, each of which is terminated
   by the end-of-line character (eol). -}
csvFile :: GenParser Char st [[String]]
csvFile = 
    do result <- many line
       eof
       return result
-- Each line contains 1 or more cells, separated by a comma
line :: GenParser Char st [String]
line = 
    do result <- cells
       eol                       -- end of line
       return result
       
-- Build up a list of cells.  Try to parse the first cell, then figure out 
-- what ends the cell.
cells :: GenParser Char st [String]
cells = 
    do first <- cellContent
       next <- remainingCells
       return (first : next)
-- The cell either ends with a comma, indicating that 1 or more cells follow,
-- or it doesn't, indicating that we're at the end of the cells for this line
remainingCells :: GenParser Char st [String]
remainingCells =
    (char ',' >> cells)            -- Found comma?  More cells coming
    <|> (return [])                -- No comma?  Return [], no more cells
-- Each cell contains 0 or more characters, which must not be a comma or
-- EOL
cellContent :: GenParser Char st String
cellContent = 
    many (noneOf ",\n")
       
-- The end of line character is \n
eol :: GenParser Char st Char
eol = char '\n'
parseCSV :: String -> Either ParseError [[String]]
parseCSV input = parse csvFile "(unknown)" inputLet's take a look at the code for this example. We didn't use many shortcuts here, so remember that this will get shorter and simpler!
      We've built it from the
      top down,
      so our first function is csvFile.  The type of this
      function is GenParser Char st [[String]].  This
      means that the type of the input is a sequence of characters, which is
      exactly what a Haskell string is, since String is
      the same as [Char].  It also means that we will
      return a value of type [[String]]: a list of a list
      of strings.  The st can be ignored for now.
    
      Parsec programmers often omit type declarations, since we write so
      many small functions.  Haskell's type inference can figure it out.
      We've listed the types for the first example here so you can get a
      better idea of what's going on.  You can always use
      :t in ghci to inspect types as well.
    
      The csvFile uses a do block.  As this
      implies, Parsec is a monadic library: it defines its own special
      parsing monad[34], GenParser.
    
      We start by running many line.
      many is a function that takes a function as an
      argument.  It tries to repeatedly
      parse the input using the function passed to it.  It gathers up the
      results from all that repeated parsing and returns a list of them.  So,
      here, we are storing the results of parsing all lines in
      result.  Then we look for the end-of-file indicator,
      called eof.  Finally, we return the
      result.  So, a CSV file is made up of many lines,
      then the end of file.  We can often read out Parsec functions in plain
      English just like this.
    
      Now we must answer the question: what is a line?  We define the
      line function to do just that.  Reading the
      function, we can see that a line consists of cells followed by the end
      of line character.
    
      So what are cells?  We defined them in the cells
      function.  The cells of a line start with the content of the first
      cell, then continue with the content of the remaining cells, if any.
      The result is simply the first cell and the remaining cells assembled
      into a list.
    
Let's skip over remainingCells for a minute and
      look at cellContent.  A cell contains any number of
      characters, but each character must not be a comma or end of line
      character.  The noneOf function matches one item, so
      long as it isn't in the list of items that we pass.  So, saying
      many (noneOf ",\n") defines a cell the way we want
      it.
    
      Back in remainingCells, we have the first example of
      a choice in Parsec.  The choice operator is <|>.  This operator
      behaves like this: it will first try the parser on the left.  If it
      consumed no input[35], it will try the parser on the right.
    
      So, in remainingCells, our task is to come up with
      all the cells after the first.  Recall that
      cellContent uses noneOf ",\n".
      So it will not consume the comma or end-of-line character from the
      input.  If we see a comma after parsing a cell, it means that at least
      one more cell follows.  Otherwise, we're done.  So, our first choice in
      remainingCells is char ','.  This
      parser simply matches the passed character in the input.  If we found a
      comma, we want this function to return the remaining cells on the line.
      At this point, the "remaining cells" looks exactly like the start of
      the line, so we call cells recursively to parse
      them.  If we didn't find a comma, we return the empty list, signifying
      no remaining cells on the line.
    
      Finally, we must define what the end-of-line indicator is.  We set it
      to char '\n', which will suit our purposes fine for
      now.
    
      At the very end of the program, we define a function
      parseCSV that takes a String and parses it as a
      CSV file.  This function is just a shortcut that calls Parsec's
      parse function, filling in a few parameters.
      parse returns Either ParseError
        [[String]] for the CSV file.  If there was an error, the
      return value will be Left with the error; otherwise, it will be
      Right with the result.
    
Now that we understand this code, let's play with it a bit and see what it does.
ghci>:l csv1.hs[1 of 1] Compiling Main ( csv1.hs, interpreted ) Ok, modules loaded: Main.ghci>parseCSV ""Loading package parsec-2.1.0.0 ... linking ... done. Right []
That makes sense: parsing the empty string returns an empty list. Let's try parsing a single cell.
ghci>parseCSV "hi"Left "(unknown)" (line 1, column 3): unexpected end of input expecting "," or "\n"
Look at that. Recall how we defined that each line must end with the end-of-line character, and we didn't give it. Parsec's error message helpfully indicated the line number and column number of the problem, and even told us what it was expecting! Let's give it an end-of-line character and continue experimenting.
ghci>parseCSV "hi\n"Right [["hi"]]ghci>parseCSV "line1\nline2\nline3\n"Right [["line1"],["line2"],["line3"]]ghci>parseCSV "cell1,cell2,cell3\n"Right [["cell1","cell2","cell3"]]ghci>parseCSV "l1c1,l1c2\nl2c1,l2c2\n"Right [["l1c1","l1c2"],["l2c1","l2c2"]]ghci>parseCSV "Hi,\n\n,Hello\n"Right [["Hi",""],[""],["","Hello"]]
      You can see that parseCSV is doing exactly what we
      wanted it to do.  It's even handling empty cells and empty lines
      properly.
    
We promised you earlier that we could simplify our CSV parser significantly by using a few Parsec helper functions. There are two that will dramatically simplify this code.
      The first tool is the sepBy function.  This
      function takes two functions as arguments: the first function
      parses some sort of content, while the second function parses a
      separator.  sepBy starts by trying to parse
      content, then separators, and alternates back and forth until it
      can't parse a separator.  It returns a list of all the content
      that it was able to parse.
    
      The second tool is endBy.  It's similar to
      sepBy, but expects the very last item to be
      followed by the separator.  That is, it continues parsing until
      it can't parse any more content.
    
      So, we can use endBy to parse lines, since every
      line must end with the end-of-line character.  We can use
      sepBy to parse cells, since the last cell will not
      end with a comma.  Take a look at how much simpler our parser is now:
    
-- file: ch16/csv2.hs import Text.ParserCombinators.Parsec csvFile = endBy line eol line = sepBy cell (char ',') cell = many (noneOf ",\n") eol = char '\n' parseCSV :: String -> Either ParseError [[String]] parseCSV input = parse csvFile "(unknown)" input
This program behaves exactly the same as the first one. We can verify this by using ghci to re-run our examples from the earlier example. We'll get the same result from every one. Yet the program is much shorter and more readable. It won't be long before you can translate Parsec code like this into a file format definition in plain English. As you read over this code, you can see that:
      Different operating systems use different characters to mark the
      end-of-line.  Unix/Linux systems, plus Windows in text mode, use
      simply "\n".
      DOS and Windows systems use "\r\n", and Macs
      traditionally used "\r".  We could add in
      support for "\n\r" too, just in case anybody
      uses that.
    
      We could easily adapt our example to be able to handle all these types
      of line endings in a single file.  We would need to make two
      modifications: adjust eol to recognize the different
      endings, and adjust the noneOf pattern in
      cell to ignore \r.
    
      This must be done carefully.  Recall that our earlier definition of
      eol was simply char '\n'.  There
      is a parser called string that we can use to match
      the multi-character patterns.  Let's start by thinking of how we would
      add support for \n\r.
    
Our first attempt might look like this:
-- file: ch16/csv3.hs -- This function is not correct! eol = string "\n" <|> string "\n\r"
      This isn't quite right.  Recall that the <|> operator always
      tries the left alternative first.  Looking for the single character
      \n will match both types of line endings, so it will
      look to the system that the following line begins with
      \r.  Not what we want.  Try it in ghci:
    
ghci>:m Text.ParserCombinators.Parsecghci>let eol = string "\n" <|> string "\n\r"Loading package parsec-2.1.0.0 ... linking ... done.ghci>parse eol "" "\n"Right "\n"ghci>parse eol "" "\n\r"Right "\n"
It may seem like the parser worked for both endings, but actually looking at it this way, we can't tell. If it left something un-parsed, we don't know, because we're not trying to consume anything else from the input. So let's look for the end-of-file after our end of line:
ghci>parse (eol >> eof) "" "\n\r"Left (line 2, column 1): unexpected "\r" expecting end of inputghci>parse (eol >> eof) "" "\n"Right ()
      As expected, we got an error from the \n\r ending.
      So the next temptation may
      be to try it this way:
    
-- file: ch16/csv4.hs -- This function is not correct! eol = string "\n\r" <|> string "\n"
      This also isn't right.  Recall that <|> only attempts the option
      on the right if the option on the left consumed no input.  But by the
      time we are able to see if there is a \r after the
      \n, we've already consumed the
      \n.  This time, we fail on the other case in ghci:
    
ghci>:m Text.ParserCombinators.Parsecghci>let eol = string "\n\r" <|> string "\n"Loading package parsec-2.1.0.0 ... linking ... done.ghci>parse (eol >> eof) "" "\n\r"Right ()ghci>parse (eol >> eof) "" "\n"Left (line 1, column 1): unexpected end of input expecting "\n\r"
      We've stumbled upon the lookahead problem.  It turns out that, when
      writing parsers, it's often very convenient to be able to "look ahead"
      at the data that's coming in.  Parsec supports this, but before showing
      you how to use it, let's see how you would have to write this to get
      along without it.  You'd have to manually expand all the options after
      the \n like this:
    
-- file: ch16/csv5.hs
eol = 
    do char '\n'
       char '\r' <|> return '\n'
      This function first looks for \n.  If it is found,
      then it will look for \r, consuming it if possible.
      Since the return type of char '\r' is a
      Char, the alternative action is to simply return a
      Char without attempting to parse anything.  Parsec
      has a function option that can also express this
      idiom as option '\n' (char '\r').  Let's test this
      with ghci.
    
ghci>:l csv5.hs[1 of 1] Compiling Main ( csv5.hs, interpreted ) Ok, modules loaded: Main.ghci>parse (eol >> eof) "" "\n\r"Loading package parsec-2.1.0.0 ... linking ... done. Right ()ghci>parse (eol >> eof) "" "\n"Right ()
This time, we got the right result! But we could have done it easier with Parsec's lookahead support.
        Parsec has a function called try that is used to express
        lookaheads.  try takes one function, a parser.  It applies that
        parser.  If the parser doesn't succeed, try behaves as if it hadn't
        consumed any input at all.  So, when you use try on the left side
        of <|>, Parsec will try the option on the right even if the
        left side failed after consuming some input.  try only has an
        effect if it is on the left of a <|>.  Keep in mind, though,
        that many functions use
        <|> internally.  Here's a way to add
        expanded end-of-line support to our CSV parser using try:
      
-- file: ch16/csv6.hs
import Text.ParserCombinators.Parsec
csvFile = endBy line eol
line = sepBy cell (char ',')
cell = many (noneOf ",\n\r")
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
parseCSV :: String -> Either ParseError [[String]]
parseCSV input = parse csvFile "(unknown)" input
        Here we put both of the two-character endings first, and run both
        tests under try.  Both of them occur to the left of a <|>, so
        they will do the right thing.  We could have put string
          "\n" within a try, but it wouldn't have altered any
        behavior since they look at only one character anyway.  We can load
        this up and test the eol function in ghci.
      
ghci>:l csv6.hs[1 of 1] Compiling Main ( csv6.hs, interpreted ) Ok, modules loaded: Main.ghci>parse (eol >> eof) "" "\n\r"Loading package parsec-2.1.0.0 ... linking ... done. Right ()ghci>parse (eol >> eof) "" "\n"Right ()ghci>parse (eol >> eof) "" "\r\n"Right ()ghci>parse (eol >> eof) "" "\r"Right ()
All four endings were handled properly. You can also test the full CSV parser with some different endings like this:
ghci>parseCSV "line1\r\nline2\nline3\n\rline4\rline5\n"Right [["line1"],["line2"],["line3"],["line4"],["line5"]]
As you can see, this program even supports different line endings within a single file.
At the beginning of this chapter, you saw how Parsec could generate error messages that list the location where the error occurred as well as what was expected. As parsers get more complex, the list of what was expected can become cumbersome. Parsec provides a way for you to specify custom error messages in the event of parse failures.
Let's look at what happens when our current CSV parser encounters an error:
ghci>parseCSV "line1"Left "(unknown)" (line 1, column 6): unexpected end of input expecting ",", "\n\r", "\r\n", "\n" or "\r"
        That's a pretty long, and technical, error message.  We could make an
        attempt to resolve this by using the monad fail function like so:
      
-- file: ch16/csv7.hs
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <|> fail "Couldn't find EOL"Under ghci, we can see the result:
ghci>:l csv7.hs[1 of 1] Compiling Main ( csv7.hs, interpreted ) Ok, modules loaded: Main.ghci>parseCSV "line1"Loading package parsec-2.1.0.0 ... linking ... done. Left "(unknown)" (line 1, column 6): unexpected end of input expecting ",", "\n\r", "\r\n", "\n" or "\r" Couldn't find EOL
        We added to the error result, but didn't really help clean up the
        output.  Parsec has an <?> operator that is designed for just these
        situations.  It is similar to <|> in that it first tries the
        parser on its left.  Instead of trying another parser in the event of
        a failure, it presents an error message.  Here's how we'd use it:
      
-- file: ch16/csv8.hs
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"Now, when you generate an error, you'll get more helpful output:
ghci>:l csv8.hs[1 of 1] Compiling Main ( csv8.hs, interpreted ) Ok, modules loaded: Main.ghci>parseCSV "line1"Loading package parsec-2.1.0.0 ... linking ... done. Left "(unknown)" (line 1, column 6): unexpected end of input expecting "," or end of line
        That's pretty helpful!  The general rule of thumb is that you put a
        human description of what you're looking for to the right of
        <?>.
      
Our earlier CSV examples have had an important flaw: they weren't able to handle cells that contain a comma. CSV generating programs typically put quotation marks around such data. But then you have another problem: what to do if a cell contains a quotation mark and a comma. In these cases, the embedded quotation marks are doubled up.
Here is a full CSV parser. You can use this from ghci, or if you compile it to a standalone program, it will parse a CSV file on standard input and convert it to a different format on output.
-- file: ch16/csv9.hs
import Text.ParserCombinators.Parsec
csvFile = endBy line eol
line = sepBy cell (char ',')
cell = quotedCell <|> many (noneOf ",\n\r")
quotedCell = 
    do char '"'
       content <- many quotedChar
       char '"' <?> "quote at end of cell"
       return content
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return '"')
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"
parseCSV :: String -> Either ParseError [[String]]
parseCSV input = parse csvFile "(unknown)" input
main =
    do c <- getContents
       case parse csvFile "(stdin)" c of
            Left e -> do putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r
      That's a full-featured CSV parser in just 21 lines of code, plus an
      additional 10 lines for the parseCSV and
      main utility functions.
    
      Let's look at the changes in this program from the previous versions.
      First, a cell may now be either a bare cell or a "quoted" cell.  We
      give the quotedCell option first, because we want to
      follow that path if the first character in a cell is the quote mark.
    
      The quotedCell begins and ends with a quote mark,
      and contains zero or more characters.  These characters can't be copied
      directly, though, because they may contain embedded, doubled-up, quote
      marks themselves.  So we define a custom quotedChar
      to process them.
    
      When we're processing characters inside a quoted cell, we first say
      noneOf "\"".  This will match and return any single
      character as long as it's not the quote mark.  Otherwise, if it is the
      quote mark, we see if we have two of them in a row.  If so, we return a
      single quote mark to go on our result string.
    
      Notice that try in quotedChar on the
      right side of <|>.  Recall that I said that
      try only has an effect if it is on the left side of <|>.  This
      try does occur on the left side of a <|>, but on the left of
      one that must be within the implementation of many.
    
      This try is important.  Let's say we are parsing a
      quoted cell, and are getting towards the end of it.  There will be
      another cell following.  So we will expect to see a quote to end the
      current cell, followed by a comma.  When we hit
      quotedChar, we will fail the
      noneOf test and proceed to the test that looks for
      two quotes in a row.  We'll also fail that one because we'll have a
      quote, then a comma.  If we hadn't used try, we'd crash with an
      error at this point, saying that it was expecting the second quote,
      because the first quote was already consumed.  Since we use try, this
      is properly recognized as not a character that's part of the cell, so
      it terminates the many quotedChar expression as
      expected.  Lookahead has once again proven very useful, and the fact
      that it is so easy to add makes it a remarkable tool in Parsec.
    
We can test this program with ghci over some quoted cells.
ghci>:l csv9.hs[1 of 1] Compiling Main ( csv9.hs, interpreted ) Ok, modules loaded: Main.ghci>parseCSV "\"This, is, one, big, cell\"\n"Loading package parsec-2.1.0.0 ... linking ... done. Right [["This, is, one, big, cell"]]ghci>parseCSV "\"Cell without an end\n"Left "(unknown)" (line 2, column 1): unexpected end of input expecting "\"\"" or quote at end of cell
Let's run it over a real CSV file. Here's one generated by a spreadsheet program:
"Product","Price"
"O'Reilly Socks",10
"Shirt with ""Haskell"" text",20
"Shirt, ""O'Reilly"" version",20
"Haskell Caps",15
    Now, we can run this under our test program and watch:
$ runhaskell csv9.hs < test.csv
["Product","Price"]
["O'Reilly Socks","10"]
["Shirt with \"Haskell\" text","20"]
["Shirt, \"O'Reilly\" version","20"]
["Haskell Caps","15"]
    Parsec's GenParser monad is an
      instance of the MonadPlus typeclass that we
      introduced in
	the section called “Looking for alternatives”.  The value
      mzero represents a parse failure, while
      mplus combines two alternative parses into
      one, using (<|>).
-- file: ch16/ParsecPlus.hs
instance MonadPlus (GenParser tok st) where
    mzero = fail "mzero"
    mplus = (<|>)When we introduced
      application/x-www-form-urlencoded text in the section called “Golfing practice: association lists”, we mentioned that we'd write a
      parser for these strings.  We can quickly and easily do this
      using Parsec.
Each key-value pair is separated by the
      & character.
-- file: ch16/FormParse.hs p_query :: CharParser () [(String, Maybe String)] p_query = p_pair `sepBy` char '&'
Notice that in the type signature, we're using Maybe to represent a value: the HTTP specification is unclear about whether a key must have an associated value, and we'd like to be able to distinguish between “no value” and “empty value”.
-- file: ch16/FormParse.hs p_pair :: CharParser () (String, Maybe String) p_pair = do name <- many1 p_char value <- optionMaybe (char '=' >> many p_char) return (name, value)
The many1 function is similar to
      many: it applies its parser repeatedly,
      returning a list of their results.  While
      many will succeed and return an empty list
      if its parser never succeeds, many1 will
      fail if its parser never succeeds, and will otherwise return a
      list of at least one element.
The optionMaybe function modifies the
      behaviour of a parser.  If the parser fails,
      optionMaybe doesn't fail: it returns
      Nothing.  Otherwise, it wraps the parser's
      successful result with Just.  This gives us the
      ability to distinguish between “no value” and
      “empty value”, as we mentioned above.
Individual characters can be encoded in one of several ways.
-- file: ch16/FormParse.hs
p_char :: CharParser () Char
p_char = oneOf urlBaseChars
     <|> (char '+' >> return ' ')
     <|> p_hex
urlBaseChars = ['a'..'z']++['A'..'Z']++['0'..'9']++"$-_.!*'(),"
p_hex :: CharParser () Char
p_hex = do
  char '%'
  a <- hexDigit
  b <- hexDigit
  let ((d, _):_) = readHex [a,b]
  return . toEnum $ dSome characters can be represented literally.  Spaces are
      treated specially, using a + character.  Other
      characters must be encoded as a % character
      followed by two hexadecimal digits.  The Numeric
      module's readHex parses a hex string as
      a number.
ghci>parseTest p_query "foo=bar&a%21=b+c"Loading package parsec-2.1.0.0 ... linking ... done. [("foo",Just "bar"),("a!",Just "b c")]
As appealing and readable as this parser is, we can profit from stepping back and taking another look at some of our building blocks.
In many popular languages, people tend to put regular expressions to work for “casual” parsing. They're notoriously tricky for this purpose: hard to write, difficult to debug, nearly incomprehensible after a few months of neglect, and provide no error messages on failure.
If we can write compact Parsec parsers, we'll gain in readability, expressiveness, and error reporting. Our parsers won't be as short as regular expressions, but they'll be close enough to negate much of the temptation of regexps.
A few of our parsers above use do notation and bind the
      result of an intermediate parse to a variable, for later use.
      One such function is p_pair.
-- file: ch16/FormParse.hs p_pair :: CharParser () (String, Maybe String) p_pair = do name <- many1 p_char value <- optionMaybe (char '=' >> many p_char) return (name, value)
We can get rid of the need for explicit variables by using
      the liftM2 combinator from
      Control.Monad.
-- file: ch16/FormParse.hs
p_pair_app1 =
    liftM2 (,) (many1 p_char) (optionMaybe (char '=' >> many p_char))This parser has exactly the same type and behaviour as
      p_pair, but it's one line long.  Instead of
      writing our parser in a “procedural” style, we've
      simply switched to a programming style that emphasises that
      we're applying parsers and
      combining their results.
We can take this applicative style of writing a parser much further. In most cases, the extra compactness that we will gain will not come at any cost in readability, beyond the initial effort of coming to grips with the idea.
The standard Haskell libraries include a module named
      Control.Applicative, which we already encountered
      in the section called “Infix use of fmap”.  This module defines a
      typeclass named Applicative, which represents an
      applicative functor.  This is a little bit
      more structured than a functor, but a little bit less than a
      monad.  It also defines Alternative, which is
      similar to MonadPlus
As usual, we think that the best way to introduce applicative functors is by putting them to work. In theory, every monad is an applicative functor, but not every applicative functor is a monad. Because applicative functors were added to the standard Haskell libraries long after monads, we often don't get an Applicative instance for free: frequently, we have to declare the monad we're using to be Applicative or Alternative.
To do this for Parsec, we'll write a small
      module that we can import instead of the normal
      Parsec module.
-- file: ch16/ApplicativeParsec.hs
module ApplicativeParsec
    (
      module Control.Applicative
    , module Text.ParserCombinators.Parsec
    ) where
import Control.Applicative
import Control.Monad (MonadPlus(..), ap)
-- Hide a few names that are provided by Applicative.
import Text.ParserCombinators.Parsec hiding (many, optional, (<|>))
-- The Applicative instance for every Monad looks like this.
instance Applicative (GenParser s a) where
    pure  = return
    (<*>) = ap
-- The Alternative instance for every MonadPlus looks like this.
instance Alternative (GenParser s a) where
    empty = mzero
    (<|>) = mplusFor convenience, our module's export section
      exports all the names we imported from both the
      Applicative and Parsec modules.
      Because we hid Parsec's version of
      (<|>) when importing, the one that
      will be exported is from Control.Applicative, as we
      would like.
We'll begin by rewriting our existing form parser from the
      bottom up, beginning with p_hex, which
      parses a hexadecimal escape sequence.  Here's the code in normal
      do-notation style.
-- file: ch16/FormApp.hs p_hex :: CharParser () Char p_hex = do char '%' a <- hexDigit b <- hexDigit let ((d, _):_) = readHex [a,b] return . toEnum $ d
And here's our applicative version.
-- file: ch16/FormApp.hs
a_hex = hexify <$> (char '%' *> hexDigit) <*> hexDigit
    where hexify a b = toEnum . fst . head . readHex $ [a,b]Although the individual parsers are mostly untouched, the
      combinators that we're gluing them together with have changed.
      The only familiar one is (<$>), which
      we already know is a synonym for
      fmap.
From our definition of Applicative, we know
      that (<*>) is
      ap.
The remaining unfamiliar combinator is
      (*>), which applies its first argument,
      throws away its result, then applies the second and returns its
      result.  In other words, it's similar to
      (>>).
Although the concepts here should mostly be familiar from
      our earlier coverage of functors and monads, we'll walk through
      this function to explain what's happening.  First, to get a grip
      on our types, we'll hoist hexify to the top
      level and give it a signature.
-- file: ch16/FormApp.hs hexify :: Char -> Char -> Char hexify a b = toEnum . fst . head . readHex $ [a,b]
Parsec's hexDigit parser parses a
      single hexadecimal digit.
ghci>:type hexDigithexDigit :: CharParser st Char
Therefore, char '%' *> hexDigit has the same
      type, since (*>) returns the result on
      its right.  (The CharParser type is nothing
      more than a synonym for GenParser Char.)
ghci>:type char '%' *> hexDigitchar '%' *> hexDigit :: GenParser Char st Char
The expression hexify <$> (char '%' *>
	hexDigit) is a parser that matches a “%”
      character followed by hex digit, and whose result is a
      function.
ghci>:type hexify <$> (char '%' *> hexDigit)hexify <$> (char '%' *> hexDigit) :: GenParser Char st (Char -> Char)
Finally, (<*>) applies the parser
      on its left, then the parser on its right, and applies the
      function that's the result of the left parse to the value that's
      the result of the right.
If you've been able to follow this, then you understand the
      (<*>) and ap
      combinators: (<*>) is plain old
      ($) lifted to applicative functors, and
      ap the same thing lifted to monads.
ghci>:type ($)($) :: (a -> b) -> a -> bghci>:type (<*>)(<*>) :: (Applicative f) => f (a -> b) -> f a -> f bghci>:type apap :: (Monad m) => m (a -> b) -> m a -> m b
Next, we'll consider the p_char
      parser.
-- file: ch16/FormApp.hs
p_char :: CharParser () Char
p_char = oneOf urlBaseChars
     <|> (char '+' >> return ' ')
     <|> p_hex
urlBaseChars = ['a'..'z']++['A'..'Z']++['0'..'9']++"$-_.!*'(),"This remains almost the same in an applicative style, save for one piece of convenient notation.
-- file: ch16/FormApp.hs
a_char = oneOf urlBaseChars
     <|> (' ' <$ char '+')
     <|> a_hexHere, the (<$) combinator uses the
      value on the left if the parser on the right succeeds.
Finally, the equivalent of p_pair_app1 is
      almost identical.
-- file: ch16/FormParse.hs
p_pair_app1 =
    liftM2 (,) (many1 p_char) (optionMaybe (char '=' >> many p_char))All we've changed is the combinator we use for lifting: the
      liftA functions act in the same ways as
      their liftM cousins.
-- file: ch16/FormApp.hs a_pair :: CharParser () (String, Maybe String) a_pair = liftA2 (,) (many1 a_char) (optionMaybe (char '=' *> many a_char))
To give ourselves a better feel for parsing with applicative functors, and to explore a few more corners of Parsec, we'll write a JSON parser that follows the definition in RFC 4627.
At the top level, a JSON value must be either an object or an array.
-- file: ch16/JSONParsec.hs
p_text :: CharParser () JValue
p_text = spaces *> text
     <?> "JSON text"
    where text = JObject <$> p_object
             <|> JArray <$> p_arrayThese are structurally similar, with an opening character, followed by one or more items separated by commas, followed by a closing character. We capture this similarity by writing a small helper function.
-- file: ch16/JSONParsec.hs
p_series :: Char -> CharParser () a -> Char -> CharParser () [a]
p_series left parser right =
    between (char left <* spaces) (char right) $
            (parser <* spaces) `sepBy` (char ',' <* spaces)Here, we finally have a use for the
      (<*) combinator that we introduced
      earlier.  We use it to skip over any white space that might
      follow certain tokens.  With this p_series
      function, parsing an array is simple.
-- file: ch16/JSONParsec.hs p_array :: CharParser () (JAry JValue) p_array = JAry <$> p_series '[' p_value ']'
Dealing with a JSON object is hardly more complicated, requiring just a little additional effort to produce a name/value pair for each of the object's fields.
-- file: ch16/JSONParsec.hs
p_object :: CharParser () (JObj JValue)
p_object = JObj <$> p_series '{' p_field '}'
    where p_field = (,) <$> (p_string <* char ':' <* spaces) <*> p_valueParsing an individual value is a matter of calling an existing parser, then wrapping its result with the appropriate JValue constructor.
-- file: ch16/JSONParsec.hs
p_value :: CharParser () JValue
p_value = value <* spaces
  where value = JString <$> p_string
            <|> JNumber <$> p_number
            <|> JObject <$> p_object
            <|> JArray <$> p_array
            <|> JBool <$> p_bool
            <|> JNull <$ string "null"
            <?> "JSON value"
p_bool :: CharParser () Bool
p_bool = True <$ string "true"
     <|> False <$ string "false"The choice combinator allows us to
      represent this kind of ladder-of-alternatives as a list.  It
      returns the result of the first parser to succeed.
-- file: ch16/JSONParsec.hs
p_value_choice = value <* spaces
  where value = choice [ JString <$> p_string
                       , JNumber <$> p_number
                       , JObject <$> p_object
                       , JArray <$> p_array
                       , JBool <$> p_bool
                       , JNull <$ string "null"
                       ]
                <?> "JSON value"This leads us to the two most interesting parsers, for numbers and strings. We'll deal with numbers first, since they're simpler.
-- file: ch16/JSONParsec.hs
p_number :: CharParser () Double
p_number = do s <- getInput
              case readSigned readFloat s of
                [(n, s')] -> n <$ setInput s'
                _         -> emptyOur trick here is to take advantage of Haskell's standard
      number parsing library functions, which are defined in the
      Numeric module.  The readFloat
      function reads an unsigned floating point number, and
      readSigned takes a parser for an unsigned
      number and turns it into a parser for possibly signed
      numbers.
Since these functions know nothing about Parsec, we have to
      work with them specially.  Parsec's
      getInput function gives us direct access to
      Parsec's unconsumed input stream.  If readSigned
	readFloat succeeds, it returns both the parsed number
      and the rest of the unparsed input.  We then use
      setInput to give this back to Parsec as its
      new unconsumed input stream.
Parsing a string isn't difficult, merely detailed.
-- file: ch16/JSONParsec.hs
p_string :: CharParser () String
p_string = between (char '\"') (char '\"') (many jchar)
    where jchar = char '\\' *> (p_escape <|> p_unicode)
              <|> satisfy (`notElem` "\"\\")We can parse and decode an escape sequence with
      the help of the choice combinator that we
      just met.
-- file: ch16/JSONParsec.hs
p_escape = choice (zipWith decode "bnfrt\\\"/" "\b\n\f\r\t\\\"/")
    where decode c r = r <$ char cFinally, JSON lets us encode a Unicode character in
      a string as “\u” followed by four
      hexadecimal digits.
-- file: ch16/JSONParsec.hs
p_unicode :: CharParser () Char
p_unicode = char 'u' *> (decode <$> count 4 hexDigit)
    where decode x = toEnum code
              where ((code,_):_) = readHex xThe only piece of functionality that applicative functors are missing, compared to monads, is the ability to bind a value to a variable, which we need here in order to be able to validate the value we're trying to decode.
This is the one place in our parser that we've needed to use a monadic function. This pattern extends to more complicated parsers, too: only infrequently do we need the extra bit of power that monads offer.
As we write this book, applicative functors are still quite new to Haskell, and people are only beginning to explore the possible uses for them beyond the realm of parsing.
As another example of applicative parsing, we will develop a basic parser for HTTP requests.
-- file: ch16/HttpRequestParser.hs
module HttpRequestParser
    (
      HttpRequest(..)
    , Method(..)
    , p_request
    , p_query
    ) where
import ApplicativeParsec
import Numeric (readHex)
import Control.Monad (liftM4)
import System.IO (Handle)An HTTP request consists of a method, an identifier, a
      series of headers, and an optional body.  For simplicity, we'll
      focus on just two of the six method types specified by the HTTP
      1.1 standard.  A POST method has a body; a
      GET has none.
-- file: ch16/HttpRequestParser.hs
data Method = Get | Post
          deriving (Eq, Ord, Show)
data HttpRequest = HttpRequest {
      reqMethod :: Method
    , reqURL :: String
    , reqHeaders :: [(String, String)]
    , reqBody :: Maybe String
    } deriving (Eq, Show)Because we're writing in an applicative style, our parser can be both brief and readable. Readable, that is, if you're becoming used to the applicative parsing notation.
-- file: ch16/HttpRequestParser.hs
p_request :: CharParser () HttpRequest
p_request = q "GET" Get (pure Nothing)
        <|> q "POST" Post (Just <$> many anyChar)
  where q name ctor body = liftM4 HttpRequest req url p_headers body
            where req = ctor <$ string name <* char ' '
        url = optional (char '/') *>
              manyTill notEOL (try $ string " HTTP/1." <* oneOf "01")
              <* crlfBriefly, the q helper
      function accepts a method name, the type constructor to apply to
      it, and a parser for a request's optional body.  The
      url helper does not attempt to validate a
      URL, because the HTTP specification does not specify what
      characters an URL contain. The function just consumes input
      until either the line ends or it reaches a HTTP version
      identifier.
The try combinator has to hold onto input in
	case it needs to restore it, so that an alternative parser can
	be used.  This practice is referred to as
	backtracking. Because
	try must save input, it is expensive to
	use.  Sprinkling a parser with unnecessary uses of
	try is a very effective way to slow it
	down, sometimes to the point of unacceptable
	performance.
The standard way to avoid the need for backtracking is to tidy up a parser so that we can decide whether it will succeed or fail using only a single token of input. In this case, the two parsers consume the same initial tokens, so we turn them into a single parser.
ghci>let parser = (++) <$> string "HT" <*> (string "TP" <|> string "ML")ghci>parseTest parser "HTTP""HTTP"ghci>parseTest parser "HTML""HTML"
Even better, Parsec gives us an improved error message if we feed it non-matching input.
ghci>parseTest parser "HTXY"parse error at (line 1, column 3): unexpected "X" expecting "TP" or "ML"
Following the first line of a HTTP request is a series of zero or more headers. A header begins with a field name, followed by a colon, followed by the content. If the lines that follow begin with spaces, they are treated as continuations of the current content.
-- file: ch16/HttpRequestParser.hs
p_headers :: CharParser st [(String, String)]
p_headers = header `manyTill` crlf
  where header = liftA2 (,) fieldName (char ':' *> spaces *> contents)
        contents = liftA2 (++) (many1 notEOL <* crlf)
                               (continuation <|> pure [])
        continuation = liftA2 (:) (' ' <$ many1 (oneOf " \t")) contents
        fieldName = (:) <$> letter <*> many fieldChar
        fieldChar = letter <|> digit <|> oneOf "-_"
crlf :: CharParser st ()
crlf = (() <$ string "\r\n") <|> (() <$ newline)
notEOL :: CharParser st Char
notEOL = noneOf "\r\n"Our HTTP request parser is too simple to be useful in real deployments. It is missing vital functionality, and is not resistant to even the most basic denial of service attacks.
1.  | Make the parser honour the
	        | 
2.  | A popular denial of service attack against naive web servers is simply to send unreasonably long headers. A single header might contain tens or hundreds of megabytes of garbage text, causing a server to run out of memory. Restructure the header parser so that it will fail if any line is longer than 4096 characters. It must fail immediately when this occurs; it cannot wait until the end of a line eventually shows up.  | 
3.  | Add the ability to honour the
	        | 
4.  | Another popular attack is to open a
	      connection and either leave it idle or send data
	      extremely slowly.  Write a wrapper in the IO monad
	      that will invoke the parser.  Use the
	        | 
[34] For more on monads, refer to Chapter 14, Monads
[35] For information on dealing with choices that may consume some input before failing, see the section called “Lookahead”.