]> git.wincent.com - wikitext.git/commit
Tokenize alphanumeric characters separately from other printables
authorWincent Colaiuta <win@wincent.com>
Wed, 23 Apr 2008 23:36:54 +0000 (01:36 +0200)
committerWincent Colaiuta <win@wincent.com>
Wed, 23 Apr 2008 23:36:54 +0000 (01:36 +0200)
commit51c8e75dd3378b4f95c221d899eaf6adc91079f1
tree739fac1f6c81e42d55474e402bbedaa78720313c
parenta759630c58624c0c304f06a6285f06afd2370c6c
Tokenize alphanumeric characters separately from other printables

This is a preliminary step prior to implementing tokenize for the
purposes of full-text search indexing. Basically we want to split
input up into three "interesting" types of tokens: URIs, email
addresses, and alphanumeric words; everything else will be
ignored.

In other words, input like:

  <nowiki>foo don't bar, win@wincent.com: http://example.com/

Would yield these tokens: foo, don, t, bar, win@wincent.com and
http://example.com/. Note that the "interesting" words can only ever
be ASCII in this initial approximation.

Signed-off-by: Wincent Colaiuta <win@wincent.com>
ext/parser.c
ext/token.c
ext/token.h
ext/wikitext_ragel.c
ext/wikitext_ragel.rl
spec/tokenizing_spec.rb