Acornima 1.2.0
Acorn + Esprima = Acornima
This project is an interbreeding of the acornjs and the Esprima.NET parsers, with the intention of creating an even more complete and performant ECMAScript (a.k.a JavaScript) parser library for .NET by combining the best bits of those.
It should also be mentioned that there is an earlier .NET port of acornjs, AcornSharp, which though is unmaintained for a long time, served as a good starting point. Had it not been for AcornSharp, this project would probably have never started.
Here is how this Frankenstein's monster looks like:
- The tokenizer is mostly a direct translation of the acornjs tokenizer to C# (with many bigger and smaller performance improvements, partly inspired by Esprima.NET) - apart from the regex validation/conversion logic, which has been borrowed from Esprima.NET.
- The parser is ~99% acornjs (also with a bunch of minor improvements) and ~1% Esprima.NET (strict mode detection, public API). It is also worth mentioning that the error reporting has been changed to use the error messages of V8.
- It includes protection against the non-catchable
StackOverflowExceptionusing the same approach as Roslyn. - Both parent projects follow the ESTree specification, so does Acornima. The actual AST implementation is based on that of Esprima.NET, with further minor improvements to the class hierarchy that bring it even closer to the spec and allow encoding a bit more information.
- The built-in AST visitors and additional utility features stems from Esprima.NET as well.
And what good comes out of this mix?
- A parser which already matches the performance of Esprima.NET, while doing more: it also passes the complete Test262 test suite for ECMAScript 2023.
- It is also more economic with regard to stack usage, so it can parse ~2x deeper structures.
- More options for fine-tuning parsing.
- A standalone tokenizer which can deal with most of the ambiguities of the JavaScript grammar (thanks to the clever context tracking solution implemented by acornjs).
- The parser tracks variable scopes to detect variable redeclarations. As of v1.1.0, it's able to expose the collected scope information to the consumer (see also this PR or this other example of usage).
Getting started
1. Install the package from NuGet
dotnet add package Acornima
Or, if you want to use additional features like JSX parsing, JavaScript generation from AST or AST to JSON conversion:
dotnet add package Acornima.Extras
2. Import the Acornima namespace in your application
using Acornima;
3. Create a parser instance
var parser = new Parser();
Or, if you want to tweak the available settings:
var parser = new Parser(new ParserOptions { /* ... */ });
4. Use the parser instance to parse your JavaScript code
var ast = parser.ParseScript("console.log('Hello world!')");
AST
Node [x]
├─AssignmentPattern : IDestructuringPatternElement [v,s]
├─CatchClause [v,s]
├─ClassBody [v,s]
├─ClassProperty : IClassElement, IProperty
│ ├─AccessorProperty : IClassElement, IProperty [v,s]
│ ├─MethodDefinition : IClassElement, IProperty [v,s]
│ └─PropertyDefinition : IClassElement, IProperty [v,s]
├─Decorator [v,s]
├─DestructuringPattern : IDestructuringPatternElement
│ ├─ArrayPattern : IDestructuringPatternElement [v,s]
│ └─ObjectPattern : IDestructuringPatternElement [v,s]
├─ImportAttribute [v,s]
├─ModuleSpecifier
│ ├─ExportSpecifier [v,s]
│ └─ImportDeclarationSpecifier
│ ├─ImportDefaultSpecifier [v,s]
│ ├─ImportNamespaceSpecifier [v,s]
│ └─ImportSpecifier [v,s]
├─Program : IHoistingScope [v]
│ ├─Module : IHoistingScope [s,t=Program]
│ └─Script : IHoistingScope [s,t=Program]
├─Property : IProperty
│ ├─AssignmentProperty : IProperty [v,s,t=Property]
│ └─ObjectProperty : IProperty [v,s,t=Property]
├─RestElement : IDestructuringPatternElement [v,s]
├─StatementOrExpression
│ ├─Expression [x]
│ │ ├─ArrayExpression [v,s]
│ │ ├─ArrowFunctionExpression : IFunction [v,s]
│ │ ├─AssignmentExpression [v,s]
│ │ ├─AwaitExpression [v,s]
│ │ ├─BinaryExpression [v]
│ │ │ ├─LogicalExpression [s]
│ │ │ └─NonLogicalBinaryExpression [s,t=BinaryExpression]
│ │ ├─CallExpression : IChainElement [v,s]
│ │ ├─ChainExpression [v,s]
│ │ ├─ClassExpression : IClass [v,s]
│ │ ├─ConditionalExpression [v,s]
│ │ ├─FunctionExpression : IFunction [v,s]
│ │ ├─Identifier : IDestructuringPatternElement [v,s]
│ │ ├─ImportExpression [v,s]
│ │ ├─Literal [v]
│ │ │ ├─BigIntLiteral [s,t=Literal]
│ │ │ ├─BooleanLiteral [s,t=Literal]
│ │ │ ├─NullLiteral [s,t=Literal]
│ │ │ ├─NumericLiteral [s,t=Literal]
│ │ │ ├─RegExpLiteral [s,t=Literal]
│ │ │ └─StringLiteral [s,t=Literal]
│ │ ├─MemberExpression : IChainElement, IDestructuringPatternElement [v,s]
│ │ ├─MetaProperty [v,s]
│ │ ├─NewExpression [v,s]
│ │ ├─ObjectExpression [v,s]
│ │ ├─ParenthesizedExpression [v,s]
│ │ ├─PrivateIdentifier [v,s]
│ │ ├─SequenceExpression [v,s]
│ │ ├─SpreadElement [v,s]
│ │ ├─Super [v,s]
│ │ ├─TaggedTemplateExpression [v,s]
│ │ ├─TemplateLiteral [v,s]
│ │ ├─ThisExpression [v,s]
│ │ ├─UnaryExpression [v]
│ │ │ ├─NonUpdateUnaryExpression [s,t=UnaryExpression]
│ │ │ └─UpdateExpression [s]
│ │ └─YieldExpression [v,s]
│ └─Statement [x]
│ ├─BlockStatement [v]
│ │ ├─FunctionBody : IHoistingScope [v,s,t=BlockStatement]
│ │ ├─NestedBlockStatement [s,t=BlockStatement]
│ │ └─StaticBlock : IClassElement, IHoistingScope [v,s]
│ ├─BreakStatement [v,s]
│ ├─ContinueStatement [v,s]
│ ├─DebuggerStatement [v,s]
│ ├─Declaration [x]
│ │ ├─ClassDeclaration : IClass [v,s]
│ │ ├─FunctionDeclaration : IFunction [v,s]
│ │ ├─ImportOrExportDeclaration
│ │ │ ├─ExportDeclaration
│ │ │ │ ├─ExportAllDeclaration [v,s]
│ │ │ │ ├─ExportDefaultDeclaration [v,s]
│ │ │ │ └─ExportNamedDeclaration [v,s]
│ │ │ └─ImportDeclaration [v,s]
│ │ └─VariableDeclaration [v,s]
│ ├─DoWhileStatement [v,s]
│ ├─EmptyStatement [v,s]
│ ├─ExpressionStatement [v]
│ │ ├─Directive [s,t=ExpressionStatement]
│ │ └─NonSpecialExpressionStatement [s,t=ExpressionStatement]
│ ├─ForInStatement [v,s]
│ ├─ForOfStatement [v,s]
│ ├─ForStatement [v,s]
│ ├─IfStatement [v,s]
│ ├─LabeledStatement [v,s]
│ ├─ReturnStatement [v,s]
│ ├─SwitchStatement [v,s]
│ ├─ThrowStatement [v,s]
│ ├─TryStatement [v,s]
│ ├─WhileStatement [v,s]
│ └─WithStatement [v,s]
├─SwitchCase [v,s]
├─TemplateElement [v,s]
└─VariableDeclarator [v,s]
Legend:
v- A visitation method is generated in the visitors for the node type.s- The node class is sealed. (It's beneficial to check for sealed types when possible.)t- The node type (the value of theNode.Typeproperty) as specified by ESTree (shown only if it differs from the name of the node class).x- The node class can be subclassed. (The AST provides some limited extensibility for special use cases.)
JSX
The library also supports the syntax extension JSX.
However, mostly for performance reasons, the related functionality is separated from the core parser: it is available in the Acornima.Extras package, in the Acornima.Jsx namespace.
Installation & usage
After installing the Acornima.Extras package as described in the Getting started section, you can parse JSX code like this:
using Acornima.Jsx;
var parser = new JsxParser(new JsxParserOptions { /* ... */ });
var ast = parser.ParseScript("<>Hello world!</>");
AST
Node [x]
└─StatementOrExpression
└─Expression [x]
└─JsxNode [x]
├─JsxAttributeLike
│ ├─JsxAttribute [v,s]
│ └─JsxSpreadAttribute [v,s]
├─JsxClosingTag
│ ├─JsxClosingElement [v,s]
│ └─JsxClosingFragment [v,s]
├─JsxElementOrFragment
│ ├─JsxElement [v,s]
│ └─JsxFragment [v,s]
├─JsxEmptyExpression [v,s]
├─JsxExpressionContainer [v,s]
├─JsxName
│ ├─JsxIdentifier [v,s]
│ ├─JsxMemberExpression [v,s]
│ └─JsxNamespacedName [v,s]
├─JsxOpeningTag
│ ├─JsxOpeningElement [v,s]
│ └─JsxOpeningFragment [v,s]
└─JsxText [v,s]
Migration from Esprima.NET
Projects using Esprima.NET can be converted to Acornima relatively easily as the public API of the two libraries are very similar. (A pretty good proof of this statement is this PR, which migrates Jint to Acornima.)
The most notable changes to keep in mind with regard to migration are the following:
- The default value of the
ParserOptions.RegExpParseModeproperty has been changed toRegExpParseMode.Validate. - The default value of the
ParserOptions.RegexTimeoutproperty has been changed to 5 seconds. - The default value of the
ParserOptions.Tolerantproperty has been changed tofalse. - The
Locationstruct has been renamed toSourceLocation. - The
TokenTypeandCommentTypeenums have been renamed named toTokenKindandCommentKind, respectively. Also, some of the member names have been changed. - The
TokenandCommentstructs have been completely reworked. TheSyntaxTokenandSyntaxCommentclasses have been removed. - The
SyntaxElementclass has been removed, that is, theNodeclass has become the root of the AST node type hierarchy. (This also means that tokens and comments are not attached to the root nodes of the AST. You can obtain those via theParserOptions.OnTokenandParserOptions.OnCommentcallbacks). - The
Nodesenum has been renamed named toNodeType. - The
Node.AssociatedDataproperty has been renamed toUserData. - The
AssignmentOperator,BinaryOperatorandUnaryOperatorenums have been merged into a single enum namedOperator. Also, some of the member names have been changed. - The
Literalnode class has been changed to only provide anobject? Value { get; }property for accessing literal value. There are sealed subclasses for the different kinds of literals. Use those to access literal values in a type-safe (and more efficient) manner. - The
Propertynode class has been made abstract and two sealed subclasses have been introduced:AssignmentPropertyandObjectProperty(for representing properties of object destructuring patterns and object literals, respectively). Also, theVisitPropertymethod has been replaced withVisitAssignmentPropertyandVisitObjectPropertyin visitors. - Similar changes have been made to the
BlockStatementnode class. Two new sealed subclasses have been introduced:FunctionBodyandNestedBlockStatement(for representing bodies of function expressions/declarations and actual block statements that occurs within function bodies, respectively). Also, to conform to the ESTree spec,StaticBlockhas been changed to be a subclass ofBlockStatement. TheVisitBlockStatementmethod has been kept in visitors, but onlyNestedBlockStatementis dispatched to it. The other two subclasses has dedicated visitation methods (VisitFunctionBodyandVisitStaticBlock). - The
ClassElementnode base class has been replaced with theIClassElementinterface. - The
Strictproperty of function expression/declaration node classes has been moved toFunctionBody. - The
JsxExpressionnode class has been renamed toJsxNode. - The
JsxElementnode class has been renamed toJsxElementOrFragmentand two sealed subclasses have been introduced:JsxElementandJsxFragment. - The
ParserExceptionclass has been renamed toParseErrorExceptionand been made abstract. Two concrete subclasses (SyntaxErrorExceptionandRegExpConversionError) have been introduced to indicate different kinds of errors that can occur during parsing. - The message format of the
ParseErrorExceptionclass has been changed. The reported messages are translatable text resources, so it is not recommended to rely on them to determine the reason of the error. For such purposes, you can use theParseErrorException.Error.Codeproperty. - The
ParseErrorException.Columnproperty has been changed to store zero-based indices. (The exception message still includes one-based column indices though.)
Benchmarks
| Method | Runtime | FileName | Mean | Allocated |
|---|---|---|---|---|
| Acornima v1.0.0 | .NET 8.0 | angular-1.2.5 | 10.679 ms | 3978.22 KB |
| Acornima v1.0.0 | .NET Framework 4.8 | angular-1.2.5 | 22.905 ms | 3999.01 KB |
| Esprima v3.0.5 | .NET 8.0 | angular-1.2.5 | 11.443 ms | 3828.11 KB |
| Esprima v3.0.5 | .NET Framework 4.8 | angular-1.2.5 | 20.483 ms | 3879.53 KB |
| Acornima v1.0.0 | .NET 8.0 | backbone-1.1.0 | 1.428 ms | 629.26 KB |
| Acornima v1.0.0 | .NET Framework 4.8 | backbone-1.1.0 | 3.218 ms | 633.09 KB |
| Esprima v3.0.5 | .NET 8.0 | backbone-1.1.0 | 1.440 ms | 613.88 KB |
| Esprima v3.0.5 | .NET Framework 4.8 | backbone-1.1.0 | 2.903 ms | 620.3 KB |
| Acornima v1.0.0 | .NET 8.0 | jquery-1.9.1 | 8.066 ms | 3271.63 KB |
| Acornima v1.0.0 | .NET Framework 4.8 | jquery-1.9.1 | 18.210 ms | 3288.41 KB |
| Esprima v3.0.5 | .NET 8.0 | jquery-1.9.1 | 8.391 ms | 3305.23 KB |
| Esprima v3.0.5 | .NET Framework 4.8 | jquery-1.9.1 | 16.456 ms | 3355.15 KB |
| Acornima v1.0.0 | .NET 8.0 | jquery.mobile-1.4.2 | 14.253 ms | 5449.24 KB |
| Acornima v1.0.0 | .NET Framework 4.8 | jquery.mobile-1.4.2 | 29.750 ms | 5480.16 KB |
| Esprima v3.0.5 | .NET 8.0 | jquery.mobile-1.4.2 | 14.566 ms | 5428.48 KB |
| Esprima v3.0.5 | .NET Framework 4.8 | jquery.mobile-1.4.2 | 27.084 ms | 5497.48 KB |
| Acornima v1.0.0 | .NET 8.0 | mootools-1.4.5 | 6.735 ms | 2755.9 KB |
| Acornima v1.0.0 | .NET Framework 4.8 | mootools-1.4.5 | 14.818 ms | 2771.45 KB |
| Esprima v3.0.5 | .NET 8.0 | mootools-1.4.5 | 6.877 ms | 2777.83 KB |
| Esprima v3.0.5 | .NET Framework 4.8 | mootools-1.4.5 | 13.740 ms | 2816.33 KB |
| Acornima v1.0.0 | .NET 8.0 | underscore-1.5.2 | 1.214 ms | 529.61 KB |
| Acornima v1.0.0 | .NET Framework 4.8 | underscore-1.5.2 | 2.775 ms | 532.29 KB |
| Esprima v3.0.5 | .NET 8.0 | underscore-1.5.2 | 1.235 ms | 539.42 KB |
| Esprima v3.0.5 | .NET Framework 4.8 | underscore-1.5.2 | 2.501 ms | 547.18 KB |
| Acornima v1.0.0 | .NET 8.0 | yui-3.12.0 | 6.408 ms | 2611.82 KB |
| Acornima v1.0.0 | .NET Framework 4.8 | yui-3.12.0 | 13.831 ms | 2628.61 KB |
| Esprima v3.0.5 | .NET 8.0 | yui-3.12.0 | 6.667 ms | 2585.78 KB |
| Esprima v3.0.5 | .NET Framework 4.8 | yui-3.12.0 | 12.636 ms | 2624.92 KB |
Known issues and limitations
Regular expressions
The parser can be configured to convert JS regular expression literals to .NET Regex instances (see ParserOptions.RegExpParseMode).
However, because of the fundamental differences between the JS and .NET regex engines, in many cases this conversion can't be done perfectly (or, in some cases, at all):
- Case-insensitive matching won't always yield the same results. Implementing a workaround for this issue would be extremely hard, if not impossible.
- The JS regex engine assigns numbers to capturing groups sequentially (regardless of the group being named or not named) but .NET uses a different, weird approach: "Captures that use parentheses are numbered automatically from left to right based on the order of the opening parentheses in the regular expression, starting from 1. However, named capture groups are always ordered last, after non-named capture groups." Without some adjustments, this would totally mess up numbered backreferences and replace pattern references. So, as a workaround, the converter wraps all named capturing groups in a non-named capturing group to force .NET to include all the original capturing groups in the resulting match in the expected order. (Of course, this won't prevent named groups from being listed after the numbered ones.) If needed, the original number of groups can be obtained from the returned
RegExpParseResultobject'sActualRegexGroupCountproperty. - The characters allowed in group names differs in the two regex engines. For example a the group name
$groupis valid in JS but invalid in .NET. So, as a workaround, the converter encodes the problematic group names to names that are valid in .NET and probably won't collide with other group names present in the pattern. For example,$groupis encoded like__utf8_2467726F7570. The original group names can be obtained using the returnedRegExpParseResultobject'sGetRegexGroupNamemethod. - Self-referencing capturing groups like
/((a+)(\1) ?)+/may not produce the exact same captures.RegexOptions.ECMAScriptis supposed to cover this, however even the MSDN example doesn't produce the same matches. (As a side note,RegexOptions.ECMAScriptis kinda a false promise, it can't even get some basic cases right by itself.) - Similarily, repeated nested groups like
/((a+)?(b+)?(c))*/may produce different captures for the groups. (JS has an overwrite behavior while .NET doesn't). - .NET treats forward references like
\1(\w)differently than JS and it's not possible to convert this kind of patterns reliably. (The converter could make some patterns work by rewriting them to something like(?:)(\w)but there are cases where even this wouldn't work.) - Unicode mode issues:
- There could be false positive empty string matches in the middle of surrogate pairs. Patterns as simple as
/a?/uwill cause this issue when the input string contains astral Unicode chars. There is no out-of-the-box workaround for this issue but it can be mitigated somewhat using a bit of "post-processing", i.e., by filtering out the false positive matches after evaluation like it's done here. Probably there is no way to improve this situation until .NET adds the option to treat the input string as Unicode code points. - Support for Unicode property escapes is pretty limited (see explanation). Currently, only General Category expressions are converted. But even this is not perfect as the result will depend the Unicode version included in the specific .NET runtime which is executing the parser's code.
- There could be false positive empty string matches in the middle of surrogate pairs. Patterns as simple as
To sum it up, legacy pattern conversion is pretty solid apart from the minor issues listed above. However, support for unicode mode (flag u) patterns is partial and quirky, while conversion of the upcoming unicode sets mode (flag v) will be next to impossible - until the .NET regex engine gets some improved Unicode support.
Any feedback appreciated, contributions are welcome!
Showing the top 20 packages that depend on Acornima.
.NET Framework 4.6.2
- System.Memory (>= 4.5.5)
- System.Runtime.CompilerServices.Unsafe (>= 6.0.0)
.NET 8.0
- No dependencies.
.NET Standard 2.0
- System.Memory (>= 4.5.5)
- System.Runtime.CompilerServices.Unsafe (>= 6.0.0)
.NET Standard 2.1
- System.Runtime.CompilerServices.Unsafe (>= 6.0.0)