Saturday, 29 April 2017

TDParseKit - Cocoa Objective-C Framework for parsing and tokenzing

User Rating: 0 / 5

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 

About

TDParseKit is a Mac OS X Framework written by Todd Ditchendorf in Objective-C 2.0 and released under the MIT Open Source License. The framework is an Objective-C implementation of the tools described in "Building Parsers with Java" by Steven John Metsker. TDParseKit includes some significant additions beyond the designs from the book (many of them hinted at in the book itself) in order to enhance the framework's feature set, usefulness and ease-of-use. Other changes have been made to the designs in the book to match common Cocoa/Objective-C design patterns and conventions. However, these changes are relatively superficial, and Metsker's book is the best documentation available for this framework.

The TDParseKit Source code is available from the Subversion repository at the Google Code project page.

More documentation:

Projects using TDParseKit:

  • Spike: A Rails log file viewer/analyzer by Matt Mower
  • JSTalk: Interprocess Cocoa scripting with JavaScript by Gus Mueller
  • Objective-J Port of TDParseKit by Ross Boucher
  • HTTP Client: HTTP debugging/testing tool
  • Fluid: Site-Specific Browser for Mac OS X
  • Cruz: Social Browser for Mac OS X

Xcode Project

The TDParseKit Xcode project consists of 6 targets:

  1. TDParseKit : the TDParseKit Objective-C framework. The central feature/codebase of this project.
  2. Tests : a UnitTest Bundle containing hundreds of unit tests (actually, more correctly, interaction tests) for the framework as well as some example classes that serve as real-world usages of the framework.
  3. DemoApp : A simple Cocoa demo app that gives a visual presentation of the results of tokenizing text using the TDTokenizer class.
  4. DebugApp : A simple Cocoa app that exists only to run arbitrary test code thru GDB with breakpoints for debugging (I was not able to do that with the UnitTest bundle).
  5. TDJSParseKit : A JavaScriptCore-based scripting interface to TDParseKit which can be used to expose the entire framework to JavaScript environments.
  6. JSDemoApp: A simple Cocoa application used for exercising the JavaScript interface provided by TDJSParseKit. Note that this is the only target which links to the WebKit framework. Neither TDParseKit nor TDJSParseKit requires WebKit.

TDParseKit Framework

The Objective-C classes in the TDParseKit Framework offer 2 basic services of general use to Cocoa developers:

  1. Tokenization via the TDTokenizer and TDToken classes.
  2. Parsing via a high-level parser-building toolkit consisting of TDParser subclasses.
  3. Declarative Parsing via Grammars an ObjC API for creating language parsers from simple declarative grammars.

Tokenization

The API for tokenization is provided by the TDTokenizer class. Cocoa developers will be familiar with the NSScanner class provided by the Foundation Framework which provides a similar service. However, the TDTokenizer class is simpler and more powerful for many use cases.

Example usage:

NSString *s = @"\"It's 123 blast-off!\", she said, // watch out!\n"
              @"and <= 3.5 'ticks' later /* wince */, it's blast-off!";
TDTokenizer *t = [TDTokenizer tokenizerWithString:s];

TDToken *eof = [TDToken EOFToken];
TDToken *tok = nil;

while ((tok = [t nextToken]) != eof) {
    NSLog(@" (%@)", tok);
}

outputs:

 ("It's 123 blast-off!")
 (,)
 (she)
 (said)
 (,)
 (and)
 (<=)
 (3.5)
 ('ticks')
 (later)
 (,)
 (it's)
 (blast-off)
 (!)

Each token produced is an object of class TDToken. TDTokens have a tokenType (Word, Symbol, Num, QuotedString, etc.) and both a stringValue and a floatValue.

More information about a token can be easily discovered using the -debugDescription method instead of the default -description. Replace the line containing NSLog above with this line:

NSLog(@" (%@)", [tok debugDescription]);

and each token's type will be printed as well:

 <Quoted String «"It's 123 blast-off!"»>
 <Symbol «,»>
 <Word «she»>
 <Word «said»>
 <Symbol «,»>
 <Word «and»>
 <Symbol «<=»>
 <Number «3.5»>
 <Quoted String «'ticks'»>
 <Word «later»>
 <Symbol «,»>
 <Word «it's»>
 <Word «blast-off»>
 <Symbol «!»>

As you can see from the output, TDTokenzier is configured by default to properly group characters into tokens including:

  • single- and double-quoted string tokens
  • common multiple character symbols (<=)
  • apostrophes, dashes and other symbol chars that should not signal the start of a new Symbol token, but rather be included in the current Word or Num token (it's, blast-off, 3.5)
  • silently ignoring C- and C++-style comments
  • silently ignoring whitespace

The TDTokenizer class is very flexible, and all of those features are configurable. TDTokenizer may be configured to:

  • recognize more (or fewer) multi-char symbols. ex:
    [t.symbolState add:@"!="];

    allows != to be recognized as a single Symbol token rather than two adjacent Symbol tokens

  • add new internal symbol chars to be included in the current Word token OR recognize internal symbols like apostrophe and dash to actually signal a new Symboltoken rather than being part of the current Word token. ex:
    [t.wordState setWordChars:YES from:'_' to:'_'];

    allows Word tokens to contain internal underscores

    [t.wordState setWordChars:NO from:'-' to:'-'];

    disallows Word tokens from containing internal dashes.

  • change which chars singnal the start of a token of any given type. ex:
    [t setTokenizerState:t.wordState from:'_' to:'_'];

    allows Word tokens to start with underscore

    [t setTokenizerState:t.quoteState from:'*' to:'*'];

    allows Quoted String tokens to start with an asterisk, effectively making * a new quote symbol (like " or ')

  • turn off recognition of single-line "slash-slash" (//) comments. ex:
    [t setTokenizerState:t.symbolState from:'/' to:'/'];

    slash chars now produce individual Symbol tokens rather than causing the tokenizer to strip text until the next newline char or begin striping for a multiline comment if appropriate (/*)

  • turn on recognition of "hash" (#) single-line comments. ex:
    [t setTokenizerState:t.commentState from:'#' to:'#'];
    [t.commentState addSingleLineStartSymbol:@"#"];
  • turn on recognition of "XML/HTML" (<!-- -->) multi-line comments. ex:
    [t setTokenizerState:t.commentState from:'<' to:'<'];
    [t.commentState addMultiLineStartSymbol:@"<!--" endSymbol:@"-->"];
  • report (rather than silently consume) Comment tokens. ex:
    t.commentState.reportsCommentTokens = YES; // default is NO
  • report (rather than silently consume) Whitespace tokens. ex:
    t.whitespaceState.reportsWhitespaceTokens = YES; // default is NO
  • turn on recognition of any characters (say, digits) as whitespace to be silently ignored. ex:
    [t setTokenizerState:t.whitespaceState from:'0' to:'9'];

Parsing

TDParseKit also includes a collection of token parser subclasses (of the abstract TDParser class) including collection parsers such as TDAlternation, TDSequence, and TDRepetition as well as terminal parsers including TDWord, TDNum, TDSymbol, TDQuotedString, etc. Also included are parser subclasses which work in individual chars such as TDChar, TDDigit, and TDSpecificChar. These char parsers are useful for things like RegEx parsing. Generally speaking though, the token parsers will be more useful and interesting.

The parser classes represent a Composite pattern. Programs can build a composite parser, in Objective-C (rather than a separate language like with lex&yacc), from a collection of terminal parsers composed into alternations, sequences, and repetitions to represent an infinite number of languages.

Parsers built from TDParseKit are non-deterministic, recursive descent parsers, which basically means they trade some performance for ease of user programming and simplicity of implementation.

Here is an example of how one might build a parser for a simple voice-search command language (note: TDParseKit does not include any kind of speech recognition technology). The language consists of:

search google for? <search-term>
...

	[self parseString:@"search google 'iphone'"];
...
	
- (void)parseString:(NSString *)s {
	TDSequence *parser = [TDSequence sequence];

	[parser add:[[TDLiteral literalWithString:@"search"] discard]];
	[parser add:[[TDLiteral literalWithString:@"google"] discard]];

	TDAlternation *optionalFor = [TDAlternation alternation];
	[optionalFor add:[TDEmpty empty]];
	[optionalFor add:[TDLiteral literalWithString:@"for"]];

	[parser add:[optionalFor discard]];

	TDParser *searchTerm = [TDQuotedString quotedString];
	[searchTerm setAssembler:self selector:@selector(workOnSearchTermAssembly:)];
	[parser add:searchTerm];

	TDAssembly *result = [parser bestMatchFor:[TDTokenAssembly assmeblyWithString:s]];
	
	NSLog(@" %@", result);

	// output:
	//  ['iphone']search/google/'iphone'^
}

...

- (void)workOnSearchTermAssembly:(TDAssembly *)a {
	TDToken *t = [a pop]; // a QuotedString token with a stringValue of 'iphone'
	[self doGoogleSearchForTerm:t.stringValue];
}