This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [re2c-devel] interest in a tagged-DFA implementation?


> Hi,
> That looks like a useful feature for some uses. Do you know how much
> overhead it imposes for the generated scanner? I'm also curious how big of
> a change it would be to the re2c code? Would it be better to have more than
> one generator or just an option to do something slightly different?

Hi Dan,

Thank you for the quick reply.

Questions of complexity are discussed in Section 4 of the paper I linked (http://laurikari.net/ville/spire2000-tnfa.pdf).

The simplest way to implement the TDFA algorithm involves keeping a table of (# of tag operations used) x (# of states in the original NFA) position values. Each DFA transition is then marked with a number of assignment operations that modify this table, reordering some existing tag values, and assigning the current position in the string to some others. (A tag operation is an instruction to save the current position in the string, so one is used to mark the start and end of each subexpression.) Thus the overhead is very low for a simple application such as 'just retrieve the entirety of the matched string', but piles up a bit if we use many subexpressions inside complicated constructs, though the algorithm *always* remains linear in the actual length of the input string.

(For this calculation, "# of states in the original NFA" is not the full number of Ins instructions, but something more like the number of states that would exist if a more standard NFA representation was used -- i.e. with fewer implicit e-transitions. The algorithm is not difficult to implement in a way that generates the same results from an NFA no matter how many untagged e-transitions are used to effectively split states into simpler Ins nodes.)

Enabling tags makes re2c's performance somewhat more uneven depending on the regexp, with rare pathological cases that involve rewriting the entire table during one transition, though this is still not as bad as some standard regexp implementations (which have features that can produce nonlinear behaviour in the length of the input).

---

What I think would fit most closely with the re2c philosophy is having an _optional_ YYTAG() primitive of some sort which allows the user to specify how the storage of tag values is implemented. I am not very familiar with the libre2c side of things, since that portion of the codebase was irrelevant for the project I'm doing, but presumably a 'standard' (good enough for most regexps) implementation of YYTAG can then be added there.

The tags have to be kept track of throughout the algorithm, but this does not preclude making them an entirely optional feature along the lines of YYFILL() where returning to the non-YYFILL algorithm is mostly a matter of omitting statements from the generated DFA. Thus code changes would perhaps consist in defining a new TAG opcode for Ins, the addition of fields to the DFA class, addition of tag calculations throughout the algorithm (guarded by if(TFlag) { ... } or such), which doesn't change how the algorithm works if we don't enable tags, but does touch a lot of places in the engine.

All the best,
      Serhei Makarov


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]