README

Construct parser combinator functions for parsing character vectors

This R package contains tools to construct parser combinator functions, higher order functions that parse input. The main goal of this package is to simplify the creation of transparent parsers for structured text files generated by machines like laboratory instruments. Such files consist of lines of text organized in higher-order structures like headers with metadata and blocks of measured values. To read these data into R you first need to create a parser that processes these files and creates R-objects as output. The parcr package simplifies the task of creating such parsers.

This package was inspired by the package “Ramble” by Chapman Siu and co-workers and by the paper “Higher-order functions for parsing” by Graham Hutton (1992).

Installation

Install the stable version from CRAN

install.packages("parcr")

To install the development version including its vignette run the following command

install_github("SystemsBioinformatics/parcr", build_vignettes=TRUE)

Example application: a parser for fasta sequence files

As an example of a realistic application we write a parser for fasta-formatted files for nucleotide and protein sequences. We use a few simplifying assumptions about this format for the sake of the example. Real fasta files are more complex than we pretend here.

Please note that more background about the functions that we use here is available in the package documentation. Here we only present a summary.

A fasta file with mixed sequence types could look like the example below:

>sequence_A
GGTAAGTCCTCTAGTACAAACACCCCCAAT
TCTGTTGCCAGAAAAAACACTTTTAGGCTA
>sequence_B
ATTGTGATATAATTAAAATTATATTCATAT
TATTAGAGCCATCTTCTTTGAAGCGTTGTC
TATGCATCGATC

>sequence_C
MTEITAAMVKELRESTGAGMMDCKNALSET
NGDFDKAVQLLREKGLGKAAKKADRLAAEG
ENEYKALVAELEKE

Since fasta files are text files we could read such a file using readLines() into a character vector. The package provides the data set fastafile which contains that character vector.

data("fastafile")

We can distinguish the following higher order components in a fasta file:

A fasta file: consists of one or more sequence blocks until the end of the file.
A sequence block: consist of a header and a nucleotide sequence or a protein sequence. A sequence block could be preceded by zero or more empty lines.
A nucleotide sequence: consists of one or more nucleotide sequence strings.
A protein sequence: consists of one or more protein sequence strings.
A header is a string that starts with a “>” immediately followed by a title without spaces.
A nucleotide sequence string is a string without spaces that consists entirely of symbols from the set {G,A,T,C}.
A protein sequence string is a string without spaces that consists entirely of symbols from the set {A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}.

It now becomes clear what we mean when we say that the package allows us to write transparent parsers: the description above of the structure of fasta files can be put straight into code for a Fasta() parser:

Fasta <- function() {
  one_or_more(SequenceBlock()) %then%
    eof()
}

SequenceBlock <- function() {
  MaybeEmpty() %then% 
    Header() %then% 
    (NuclSequence() %or% ProtSequence()) %using%
    function(x) list(x)
}

NuclSequence <- function() {
  one_or_more(NuclSequenceString()) %using% 
    function(x) list(type = "Nucl", sequence = paste(x, collapse=""))
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString()) %using% 
    function(x) list(type = "Prot", sequence = paste(x, collapse=""))
}

Functions like one_or_more(), %then%, %or%, %using%, eof() and MaybeEmpty() are defined in the package and are the basic parsers with which the package user can build complex parsers. The %using% operator uses the function on its right-hand side to modify parser output on its left hand side. Please see the vignette in the parcr package for more explanation why this is useful or necessary even.

Notice that the new parser functions that we define above are higher order functions taking no input, hence the empty argument brackets () behind their names.

Now we need to define the parsers Header(), NuclSequenceString() and ProtSequenceString() that actually recognize and process the header line string and strings of nucleotide or protein sequences in the character vector fastafile. We use the function constructor stringparser() from the package to construct helper functions that recognize and capture the desired matches, and we use match_s() to to create parcr compliant parsers from these.

Header <- function() {
  match_s(stringparser("^>(\\w+)")) %using% 
    function(x) list(title = unlist(x))
}

NuclSequenceString <- function() {
  match_s(stringparser("^([GATC]+)$"))
}

ProtSequenceString <- function() {
  match_s(stringparser("^([ARNDBCEQZGHILKMFPSTWYV]+)$"))
}

Now we have all the elements that we need to apply the Fasta() parser.

Fasta()(fastafile)
#> $L
#> $L[[1]]
#> $L[[1]]$title
#> [1] "sequence_A"
#> 
#> $L[[1]]$type
#> [1] "Nucl"
#> 
#> $L[[1]]$sequence
#> [1] "GGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA"
#> 
#> 
#> $L[[2]]
#> $L[[2]]$title
#> [1] "sequence_B"
#> 
#> $L[[2]]$type
#> [1] "Nucl"
#> 
#> $L[[2]]$sequence
#> [1] "ATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC"
#> 
#> 
#> $L[[3]]
#> $L[[3]]$title
#> [1] "sequence_C"
#> 
#> $L[[3]]$type
#> [1] "Prot"
#> 
#> $L[[3]]$sequence
#> [1] "MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE"
#> 
#> 
#> 
#> $R
#> list()

The output of the parser consists of two elements, L and R, where L contains the parsed and processed part of the input and R the remaining un-parsed part of the input. Since we explicitly demanded to parse until the end of the file by the eof() function in the definition of the Fasta() parser, the R element contains an empty list to signal that the parser was indeed at the end of the input. Please see the package documentation for more examples and explanation.

Finally, let’s present the result of the parse more concisely using the names of the elements inside the L element:

d <- Fasta()(fastafile)[["L"]]
invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")}))
#> Nucl sequence_A GGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA 
#> Nucl sequence_B ATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC 
#> Prot sequence_C MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE

Getting useful error messages when parsing

Basic error messaging is implemented in the function reporter(). You can wrap a parser in the reporter() function to obtain an error message that reports the line of the input in which the parser ultimately failed as well as some lines around it to provide context. Suppose we have the following badly formatted fasta file:

bad_header <- c(
  "*sequence_A",
  "GGTAAGTCCTCTAGTACAAACACCCCCAAT",
  ">sequence_B",
  "ATTGTGATATAATTAAAATTATATTCATAT"
)

Note that the first header starts with * instead of a >. Upgrading the Fasta() parser with the reporter() function to an error reporting parser yields a basic error message:

reporter(Fasta())(bad_header)

#> Error : Parser failed on line 1 of input.
#>   1 | >> *sequence_A
#>   2 |    GGTAAGTCCTCTAGTACAAACACCCCCAAT
#>   3 |    >sequence_B
#>   4 |    ATTGTGATATAATTAAAATTATATTCATAT

We could, however, get better error messaging by upgrading the Header() parser to a named parser:

Header <- function() {
  named(
    match_s(stringparser("^>(\\w+)")) %using% 
      function(x) list(title = unlist(x)),
    "FASTA header (>sequence_name)"
  )
}

where the first argument to the named() function is a parser body and the second argument is a brief description of the parser. Now, the reporter yields a more detailed message:

reporter(Fasta())(bad_header)

#> Error : Parser failed on line 1 of input.
#> Expected: FASTA header (>sequence_name)
#>   1 | >> *sequence_A
#>   2 |    GGTAAGTCCTCTAGTACAAACACCCCCAAT
#>   3 |    >sequence_B
#>   4 |    ATTGTGATATAATTAAAATTATATTCATAT

Suppose we have the following bad fasta file:

missing_sequence <- c(
  ">sequence_A",
  ">sequence_B",
  "ATTGTGATATAATTAAAATTATATTCATAT"
)

Upgrading the NuclSequence and ProtSequence to named parsers yields a better error message:

NuclSequence <- function() {
  named(
    one_or_more(NuclSequenceString()) %using% 
      function(x) list(type = "Nucl", sequence = paste(x, collapse="")),
    "Nucleotide_Sequence"
  )
}

ProtSequence <- function() {
  named(
    one_or_more(ProtSequenceString()) %using% 
      function(x) list(type = "Prot", sequence = paste(x, collapse="")),
    "Protein_Sequence"
    
  )
}

reporter(Fasta())(missing_sequence)

#> Error : Parser failed on line 2 of input.
#> Expected one of: Nucleotide_Sequence, Protein_Sequence
#>   1 |    >sequence_A
#>   2 | >> >sequence_B
#>   3 |    ATTGTGATATAATTAAAATTATATTCATAT

mirror server hosted at Truenetwork, Russian Federation.