This R package contains tools to construct parser combinator
functions, higher order functions that parse input. The main goal of
this package is to simplify the creation of transparent parsers
for structured text files generated by machines like laboratory
instruments. Such files consist of lines of text organized in
higher-order structures like headers with metadata and blocks of
measured values. To read these data into R you first need to create a
parser that processes these files and creates R-objects as output. The
parcr
package simplifies the task of creating such
parsers.
This package was inspired by the package “Ramble” by Chapman Siu and co-workers and by the paper “Higher-order functions for parsing” by Graham Hutton (1992).
Install the stable version from CRAN
install.packages("parcr")
To install the development version including its vignette run the following command
install_github("SystemsBioinformatics/parcr", build_vignettes=TRUE)
As an example of a realistic application we write a parser for fasta-formatted files for nucleotide and protein sequences. We use a few simplifying assumptions about this format for the sake of the example. Real fasta files are more complex than we pretend here.
Please note that more background about the functions that we use here is available in the package documentation. Here we only present a summary.
A fasta file with mixed sequence types could look like the example below:
>sequence_A
GGTAAGTCCTCTAGTACAAACACCCCCAAT
TCTGTTGCCAGAAAAAACACTTTTAGGCTA
>sequence_B
ATTGTGATATAATTAAAATTATATTCATAT
TATTAGAGCCATCTTCTTTGAAGCGTTGTC
TATGCATCGATC
>sequence_C
MTEITAAMVKELRESTGAGMMDCKNALSET
NGDFDKAVQLLREKGLGKAAKKADRLAAEG
ENEYKALVAELEKE
Since fasta files are text files we could read such a file using
readLines()
into a character vector. The package provides
the data set fastafile
which contains that character
vector.
data("fastafile")
We can distinguish the following higher order components in a fasta file:
{G,A,T,C}
.{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}
.It now becomes clear what we mean when we say that the package allows
us to write transparent parsers: the description above of the
structure of fasta files can be put straight into code for a
Fasta()
parser:
<- function() {
Fasta one_or_more(SequenceBlock()) %then%
eof()
}
<- function() {
SequenceBlock MaybeEmpty() %then%
Header() %then%
NuclSequence() %or% ProtSequence()) %using%
(function(x) list(x)
}
<- function() {
NuclSequence one_or_more(NuclSequenceString()) %using%
function(x) list(type = "Nucl", sequence = paste(x, collapse=""))
}
<- function() {
ProtSequence one_or_more(ProtSequenceString()) %using%
function(x) list(type = "Prot", sequence = paste(x, collapse=""))
}
Functions like one_or_more()
, %then%
,
%or%
, %using%
, eof()
and
MaybeEmpty()
are defined in the package and are the basic
parsers with which the package user can build complex parsers. The
%using%
operator uses the function on its right-hand side
to modify parser output on its left hand side. Please see the vignette
in the parcr
package for more explanation why this is
useful or necessary even.
Notice that the new parser functions that we define above are higher
order functions taking no input, hence the empty argument brackets
()
behind their names. Now we need to define the
line-parsers Header()
, NuclSequenceString()
and ProtSequenceString()
that recognize and process the
header line and single lines of nucleotide or protein sequences in the
character vector fastafile
. We use functions from
stringr
to do this in a few helper functions, and we use
match_s()
to to create parcr
parsers from
these.
# returns the title after the ">" in the sequence header
<- function(line) {
parse_header # Study stringr::str_match() to understand what we do here
<- stringr::str_match(line, "^>(\\w+)")
m if (is.na(m[1])) {
return(list()) # signal failure: no title found
else {
} return(m[2])
}
}
# returns a nucleotide sequence string
<- function(line) {
parse_nucl_sequence_line # The line must consist of GATC from the start (^) until the end ($)
<- stringr::str_match(line, "^([GATC]+)$")
m if (is.na(m[1])) {
return(list()) # signal failure: not a valid nucleotide sequence string
else {
} return(m[2])
}
}
# returns a protein sequence string
<- function(line) {
parse_prot_sequence_line # The line must consist of ARNDBCEQZGHILKMFPSTWYV from the start (^) until the
# end ($)
<- stringr::str_match(line, "^([ARNDBCEQZGHILKMFPSTWYV]+)$")
m if (is.na(m[1])) {
return(list()) # signal failure: not a valid protein sequence string
else {
} return(m[2])
} }
Then we define the line-parsers.
<- function() {
Header match_s(parse_header) %using%
function(x) list(title = unlist(x))
}
<- function() {
NuclSequenceString match_s(parse_nucl_sequence_line)
}
<- function() {
ProtSequenceString match_s(parse_prot_sequence_line)
}
where match_s()
is also a parser defined in
parcr
.
Now we have all the elements that we need to apply the
Fasta()
parser.
Fasta()(fastafile)
#> $L
#> $L[[1]]
#> $L[[1]]$title
#> [1] "sequence_A"
#>
#> $L[[1]]$type
#> [1] "Nucl"
#>
#> $L[[1]]$sequence
#> [1] "GGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA"
#>
#>
#> $L[[2]]
#> $L[[2]]$title
#> [1] "sequence_B"
#>
#> $L[[2]]$type
#> [1] "Nucl"
#>
#> $L[[2]]$sequence
#> [1] "ATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC"
#>
#>
#> $L[[3]]
#> $L[[3]]$title
#> [1] "sequence_C"
#>
#> $L[[3]]$type
#> [1] "Prot"
#>
#> $L[[3]]$sequence
#> [1] "MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE"
#>
#>
#>
#> $R
#> list()
The output of the parser consists of two elements, L
and
R
, where L
contains the parsed and processed
part of the input and R
the remaining un-parsed part of the
input. Since we explicitly demanded to parse until the end of the file
by the eof()
function in the definition of the
Fasta()
parser, the R
element contains an
empty list to signal that the parser was indeed at the end of the input.
Please see the package documentation for more examples and
explanation.
Finally, let’s present the result of the parse more concisely using
the names of the elements inside the L
element:
<- Fasta()(fastafile)[["L"]]
d invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")}))
#> Nucl sequence_A GGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA
#> Nucl sequence_B ATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC
#> Prot sequence_C MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE