Pick a markdown and code style:
Abstract syntax trees are usually defined recursively. This normally means some form of smart pointer is requried to implement them. The problem with this aspect of Rust is that the smart pointers block nested patterns from matching against AST expressions. It is possible to define recursive structures using references (borrows) but all references in an AST structure must have compatible lifetimes. An "arena" structure is required to hold the values that they reference. While it's possible to define such an arena manually, rustlr provides support for using the bumpalo crate.
A grammar can begin with the declaration
auto-bump
in place of auto
, which enables the generation of bump-allocated
ASTs. It is also necessary to cargo add bumpalo
or place
bumpalo = "3"
in your crate dependencies. in Cargo.toml. Although
user code do not need to reference the crate directly, the generated
parser code does.
The disadvantage of bumpalo is that it bypasses some of the memory safety checks of Rust. Bump-allocation is not recommended if frequent changes are made to the allocated structures. They are appropriate if the AST remains relatively stable once created, with few changes if any, until the entire structure can be dropped.
The advantage of bump-allocation, besides an increase in speed, is that it enables the matching of nested patterns against recursive types. Consider the following recursive type that tries to avoid smart pointers:
enum Expr<'t> {
Var(&'t str),
Negative(&'t Expr<'t>),
Plus(&'t Expr<'t>, &'t Expr<'t>),
Minus(&'t Expr<'t>, &'t Expr<'t>),
}
and a function that "pretty-prints" such expressions. The function will print
x
instead of --x
, x-y
instead of x+-y
, and x+y
instead of x--y
.
It also pushes negation inside plus/minus, and only prints parentheses as
required by the non-associative minus symbol.
fn pprint<'t>(expr:&'t Expr<'t>,) {
use crate::Expr::*;
match expr {
Negative(Negative(n)) => pprint(n),
Negative(Plus(a,b)) => pprint(&Minus(&Negative(a),b)),
Negative(Minus(a,b)) => pprint(&Plus(&Negative(a),b)),
Negative(n) => {print!("-"); pprint(n)},
Plus(a,Negative(b)) => pprint(&Minus(a,b)),
Minus(a,Negative(b)) => pprint(&Plus(a,b)),
Minus(a,p@Minus(_,_)) => { pprint(a); print!("-("); pprint(p); print!(")")},
Minus(a,p@Plus(_,_)) => { pprint(a); print!("+("); pprint(p); print!(")")},
Plus(a,b) => {pprint(a); print!("+"); pprint(b);},
Minus(a,b) => {pprint(a); print!("-"); pprint(b)},
Var(x) => print!("{}",x),
}//match expr
}//pprint
Then given
let q = Plus(&Negative(&Negative(&Var("x"))),&Negative(&Var("y")));
pprint(&q);
will print x-y
. Such a "declarative" style of programming is not
possible if the type is defined using Box (or LBox).
Notice that the references to new structures created by the function are
passed recursively "up-stack", and so are safe.
However, allocating such structures temporarily on the stack is impractical.
We would not be able to return references to them.
You may also get the dreaded compiler error "creates a temporary which is freed while still in use" when writing expressions such as Plus(&q,&Negative(&q))
.
The bumpalo crate offers an "arena", a kind of simulated heap, that allow us to access these structures within the lifetime of the arena:
let bump = bumpalo::Bump::new(); // creates new bump arena
let p = bump.alloc(Negative(bump.alloc(Negative(bump.alloc(Var("z"))))));
The Bump::alloc function returns a reference to the allocated memory within the arena. We can pass a reference to the "bump" to a function, which would then be able to construct and return bump-allocated structures using references of the same lifetime.
Rustlr can generate such structures automatically.
Most of bumpalo's usage is well-encapsulated, although
bumpalo = "3"
does need to be added to the crate's
dependencies. The easiest way to enable bumpalo is with
the Lexsource structure. The following code fragment
demonstrates how to invoke a parser generated from an auto-bump
grammar,
bautocalc.grammar,
let srcfile = "test1.txt"; // srcfile names file to parse
let source=LexSource::with_bump(srcfile).unwrap();
let mut scanner = bautocalcparser::bautocalclexer::from_source(&source);
let mut parser = bautocalcparser::make_parser(scanner);
let result = bautocalcparser::parse_with(&mut parser);
A Rustlr Lexsource object containing a Bump arena is created with Lexsource::with_bump.
The parse_with
function, which is generated for individual parsers, will place
a reference to the bump arena inside the parser. The automatically
generated semantic actions will call Bump::alloc to create ASTS that will have the
same lifetime as the Lexsource structure.
Although the LBox is no longer needed, a device to capture the
lexical position (line/column) information in the AST in a
non-intrusive manner is still re required.
The struct LC replaces LBox when using bumpalo.
This is just a tuple with open fields
for the value, and an inner tuple with line and column numbers. We've
implemented The Deref/DerefMut traits so the value can be exposed in
the manner of a smart pointer, but there's in fact no pointer
underneath. Inside a grammar, right-hand side labels such as [a]
will automatically bind a
to an LC struct and generate AST types
using LC. For example,
Expr:Div --> Expr:[e1] / Expr:[e2]
will generate an Expr<'lt>
enum with a variant
Div{e1:&'lt LC<Expr<'lt>>,e2:&'lt LC<Expr<'lt>>},
as well as semantic actions that inserts line/column information into the AST as LC enclosures.
A lifetime argument is required for all recursive types.
Please note that, when it comes to vectors and other heap-allocated data structures, we chose not to use bumpalo's vectors. Vectors that are part of the AST are still allocated on the heap, though they may contain references to bump-allocated structures.
There is currently one minor limitation with the auto-bump
option. When
a struct is recursive, it may not be possible to generate code that
#[derive(Default)]
, which is required for all types in Rustlr, without some
help from the user. A reference type &A
does not have a default and so
a type A
that contains a reference to itself also does not have a default. In
such cases, help from the user is required to implement the trait for
these reference types. For example,
A1 --> B1 int
B1 --> var var A1
flatten B1
would generate the types
#[derive(Default,Debug)]
pub struct A1<'lt>(pub &'lt str,pub &'lt str,pub &'lt A1<'lt>,pub i64,);
#[derive(Default,Debug)]
pub struct B1<'lt>(pub &'lt str,pub &'lt str,pub &'lt A1<'lt>,);
but will fail to compile. The following warning will be given by rustlr:
WARNING: Recursive structs may require the manual implementation of the Default trait for reference types, as in
impl<'lt> Default for &'lt A1<'lt> ...
The warning should be heeded by including the following in the grammar:
$static A1DEFAULT:A1<'static> = A1("","",&A1DEFAULT,0);
$impl<'t> Default for &'t A1<'t> { fn default() -> Self { &A1DEFAULT } }
Lines that begin with $
are injected into the _ast.rs
file created from the
grammar. These definitions will allow the #[derive(Default)]
tags for the
A1
and B1
types to be effective. Currently, such lines are not generated
automatically because of Rust's restrictions to what static definitions can
reference.
This problem can also be avoided by just using enums instead of
structs in such cases: just assign a left-side label to the sole
production rules for A1
and B1
. Enums always include a _Nothing
variant
so that a default can always be defined.
There is another implemention-related detail that may be of occassional concern.
To integrate the bumpalo option into Rustlr with minimal disturbance to its
other components, the Bumper struct was introduced. A grammar with the auto-bump
option will generate a parser with a Bumper struct as its exstate
field.
This struct contains procedures to access the encapsulated bump arena. One
can still declare and use an arbitrary externtype
for a Bumper will also include
a field of this type, which is accessible as a ref mut via the
Bumper::state procedure.
In most cases, this detail is only required to be understood if the
exstate
is to be used for another purpose, in which case all references
to parser.exstate
should be replaced by parser.exstate.state()
.
In addition to exstate
each parser also contains a separate,
reference-counted shared_state
field of the same type.
The only other situation that requires understanding of this implementation
detail is when writing semantic actions manually that creates
bump-allocated structures. They should be created by calling
parser.exstate.make(...)
. Here, parser.exstate
is of type Bumper. See the
documentation of Bumper for other functions that can be called.
The same applies to parser.shared_state
under auto-bump
.
We conclude this Chapter with a full example. The following grammar
defines the syntax of propositional (sentential) logic. To be self-contained,
a main
function is injected directly into the parser with !
. This time,
instead of using Lexsource::with_bump we have created a
"Bump arena" manually and inserted it into the parser. All structures created
will have the same lifetime as the bump allocator.
A function can also be injected into the generated AST file (logic_ast.rs
)
using lines starting with $
.
The NNF
function stands for Negation Normal Form: it pushes negations
inward by applying the De Morgan laws and by eliminating double negations
until negations only appear before propositional variables.
# Grammar "bumplogic" for propositional logic with bump-allocated ASTs
auto-bump
lifetime 'lt
lexterminal AND *
lexterminal OR +
lexterminal NOT ~
lexterminal IMPLIES ->
lexterminal SEMICOLON ;
terminals ( )
valterminal Prop Alphanumeric
nonterminal FormulaSeq
nonterminal Formula
nonterminal PrimaryFormula : Formula
# error recovery point
resynch SEMICOLON
topsym FormulaSeq
left AND 80
left OR 50
right IMPLIES 30
PrimaryFormula --> ( Formula )
PrimaryFormula --> Prop
PrimaryFormula:Neg --> NOT PrimaryFormula
Formula --> PrimaryFormula
Formula:And --> Formula AND Formula
Formula:Or --> Formula OR Formula
Formula:Implies --> Formula IMPLIES Formula
FormulaSeq --> Formula<SEMICOLON+>
# function to return negation normal form, injected into bumplogic_ast.rs:
$
$pub fn NNF<'t>(form:&'t Formula<'t>, bump:&'t bumpalo::Bump) -> &'t Formula<'t> {
$ use Formula::*;
$ let REF = |x|{bump.alloc(x)}; // for simplified syntax
$ let nnf = |x|{NNF(x,bump)};
$ match form {
$ Neg(Neg(A)) => nnf(A), // the nnf of ~~A is the nnf of A
$ Neg(And(A,B)) => REF(Or(nnf(REF(Neg(A))),nnf(REF(Neg(B))))),
$ Neg(Or(A,B)) => REF(And(nnf(REF(Neg(A))),nnf(REF(Neg(B))))),
$ Neg(Implies(A,B)) => REF(And(nnf(A),nnf(REF(Neg(B))))),
$ And(A,B) => REF(And(nnf(A),nnf(B))),
$ Or(A,B) => REF(Or(nnf(A),nnf(B))),
$ Implies(A,B) => nnf(REF(Or(REF(Neg(A)),B))), // ~A+B
$ _ => form, //default no change to literals
$ }//match
$}
# function injected into bumplogicparser.rs:
!mod bumplogic_ast;
!use std::io::{Write};
!fn main() {
! let bump = bumpalo::Bump::new();
! print!("Enter proposition: "); let r=std::io::stdout().flush();
! let mut input = String::new();
! let res = std::io::stdin().read_line(&mut input);
! let mut lexer1 = bumplogiclexer::from_str(&input);
! let mut parser1 = make_parser(lexer1);
! parser1.exstate.set(&bump); //the exstate is a "Bumper"
! let fseq = parse_with(&mut parser1)
! .unwrap_or_else(|x|{println!("Parsing Errors Encountered"); x});
! if let FormulaSeq(formulas) = fseq {
! for f in &formulas {
! let nnf = NNF(f,parser1.exstate.get());
! println!("NNF for line {}: {:?}",f.line(),nnf);
! }
! }
!}//main
Where the AST structures generated for this grammar are
#[derive(Debug)]
pub enum Formula<'lt> {
And(&'lt Formula<'lt>,&'lt Formula<'lt>),
Or(&'lt Formula<'lt>,&'lt Formula<'lt>),
Implies(&'lt Formula<'lt>,&'lt Formula<'lt>),
Neg(&'lt Formula<'lt>),
Prop(&'lt str),
Formula_Nothing,
}
impl<'lt> Default for Formula<'lt> { fn default()->Self { Formula::Formula_Nothing } }
#[derive(Default,Debug)]
pub struct FormulaSeq<'lt>(pub Vec<&'lt LC<Formula<'lt>>>,);