Wincent Colaiuta [Thu, 6 Aug 2009 12:11:21 +0000 (14:11 +0200)]
Silence compiler warning in StringValue
Our use of "StringValue(j)", where "j" is of type "long" and the macro
wants a value of type "VALUE" ("unsigned long") produce a warning that
"pointer targets in passing argument 1 of ‘rb_string_value’ differ in
signedness".
Merely casting ("StringValue((VALUE)j)") doesn't work because
"StringValue" itself is a macro and produces this warning: "argument to
'&' not really an lvalue; this will be a hard error in the future".
So use a temporary variable to suppress the warning without having to
use a cast inside the macro.
Wincent Colaiuta [Wed, 27 May 2009 15:49:24 +0000 (17:49 +0200)]
Use fixed-width markup in release notes for class names etc
Make things more consistent with the rest of the documentation by using
fixed-width font markup for CSS class names and HTML tags in the release
notes (rather than using double quotes).
Wincent Colaiuta [Tue, 26 May 2009 20:28:18 +0000 (22:28 +0200)]
Update README for new PRE_START "lang" attribute
Document the use of the "lang" attribute in PRE_START tags. Note that
contrary to what is said in the commit message for 024870e, uppercase
letters are in fact allowed in the attribute value.
Wincent Colaiuta [Tue, 26 May 2009 20:03:31 +0000 (22:03 +0200)]
Handle optional "lang" attribute in PRE_START tags
In addition to the literal "<pre>" tag syntax accept a special syntax
which includes a "lang" attribute. This is used to markup a PRE_START
for syntax highlighting as a particular language; for example:
Note that language names must consist only of lower-case letters, and
"-syntax" is automatically appended, to prevent users from inserting
totally arbitrary CSS class names into the translated output. This means
that names like "ruby" and "c" will work, but that names like "obj-c" or
"ObjC" would have to be written as "objc".
In this implementation the required format for the PRE_START tag is very
strict; no excess whitespace is allowed within the tag. This has the
benefit that it is simple and fast, and doesn't require re-scanning the
token to see exactly where the value of the "lang" attribute begins and
ends.
Wincent Colaiuta [Thu, 21 May 2009 19:13:34 +0000 (21:13 +0200)]
Really change license header
90de8af corrected the license header in the generated C file rather than
the Ragel source (I need to add an "rl" filetype to Ack so that I won't
overlook this kind of thing in the future).
Wincent Colaiuta [Wed, 13 May 2009 21:09:44 +0000 (23:09 +0200)]
Swap README.rdoc and doc/README (symlink <--> file)
Looks like a top-level symlink won't get picked up by GitHub,
so swap things around: make the top-level item be the real
file and the item in the "doc" subdirectory can be a symlink
(RDoc will continue to generate the HTML for the README as
before).
Wincent Colaiuta [Wed, 13 May 2009 16:08:40 +0000 (18:08 +0200)]
Add README symlink at top level for GitHub mirror
GitHub will automatically display any README found at the top
level of the tree, so set up a symlink (we'll see if it works)
from doc/README to README.rdoc.
Wincent Colaiuta [Wed, 13 May 2009 09:04:16 +0000 (11:04 +0200)]
Add ary_includes2, ary_includes3 functions
A fairly common pattern in the codebase is successively calling
the ary_includes function 1, 2 or 3 times. In the worst case
scenario this incurs 3 function calls and three array traversals
via a for-loop in the function.
Add 2 and 3-argument variants of the function so that we can
replace these instances with a single function call and a single
array traversal.
As shown in the benchmarking notes, this bumps the ary_includes
function from 4th place in the profile (where it was the "low-hanging
fruit") and right off the list.
Wincent Colaiuta [Tue, 12 May 2009 22:44:49 +0000 (00:44 +0200)]
Bump version number for 1.7 release
Normally "minor" version bumps (eg 1.6 to 1.7) are
reserved for versions which introduce new features
and "tiny" version bumps (eg 1.6 to 1.6.1) are for
maintenance (bugfix) releases.
This release doesn't contain any new features but
I am nevertheless bumping the "minor" version
number because the release does involve a fairly
major rewrite of the internals, and as such is
less conservative than a normal "maintenance"
update.
Wincent Colaiuta [Tue, 12 May 2009 22:29:20 +0000 (00:29 +0200)]
Change internal function prefix from "_Wikitext_" to "wiki_"
Follow the pattern already established in str.c/str.h and
ary.c/ary.h, and use a short lowercase prefix with no
leading underscore for "internal" functions (that is,
functions that are for use within the C part of the extension
and are not exposed as Ruby methods).
This buys us some horizontal space and makes keeping things
in a reasonable number of columns a bit easier.
Externally exported methods still follow the old pattern of
Module_class_method (eg. Wikitext_parser_parse).
Wincent Colaiuta [Tue, 12 May 2009 21:32:00 +0000 (23:32 +0200)]
Remove comment from _Wikitext_append_entity_from_utf32_char
There is no point in special casing entities like "quot", "amp"
and such in this function because neither of the two call sites
would ever pass in such a code point:
- the _Wikitext_append_sanitized_link_target already explicitly
handles all these cases, either emitting the entities manually
or raising an exception.
- the DEFAULT case won't ever have to process those characters
because they would have already been tokenized otherwise and
handled in the QUOT, AMP, LESS and GREATER cases.
Wincent Colaiuta [Tue, 12 May 2009 21:23:27 +0000 (23:23 +0200)]
Remove Wikitext_parser_encode_special_link_target function
"Special" link targets haven't been treated specially since prior
to version 1.4.0, so the Wikitext_parser_encode_special_link_target
function is literally identical to the Wikitext_parser_encode_link_target
function.
Seeing as it was only ever exposed for testing purposed and never
advertised or documented as public API, get rid of it.
Wincent Colaiuta [Tue, 12 May 2009 20:48:52 +0000 (22:48 +0200)]
Abandon plan to eliminate ALLOC_N in _Wikitext_encode_link_target
It's not worth trying to eliminate this ALLOC_N because we are
writing the encoded output back to the input buffer. If we have
to use any percent escapes then we would need to do a potentially
expensive memmove operation each time we insert an escape.
(Although note, in the case where we eat leading whitespace we would
not need a memmove for that because we are effectively doing a
manual move anyway).
We could potentially add a special case which overwrote in-place
only if there were no percent escapes (ie. which only did the ALLOC_N
on seeing a percent escape, and only if there were not enough leading
whitespace to compensate for the extra characters required by the
escape) but I am not sure if it will be worth the extra complexity.
Wincent Colaiuta [Tue, 12 May 2009 20:36:18 +0000 (22:36 +0200)]
Remove unnecessary parens
C operator precendence makes most of these parens unnecessary
so remove them.
We don't remove the parens around the "&&" expressions even
though they aren't necessary, because without them GCC will
warn "suggest parentheses around && within ||". Better to
have a warning-free build than shave off a few bytes at any
cost.
Wincent Colaiuta [Tue, 12 May 2009 20:22:27 +0000 (22:22 +0200)]
Rename input var in _Wikitext_encode_link_target
This is the first in a series of changes to the
_Wikitext_encode_link_target function to make it
more consistent with the other, similar
_Wikitext_append_sanitized_link_target function.
Wincent Colaiuta [Tue, 12 May 2009 10:00:41 +0000 (12:00 +0200)]
Conditional compilation for Ruby 1.8.x and 1.9.x
Detect the Ruby version in extconf.rb using the RUBY_VERSION
constant and pass this along as a C preprocessor macro (either
RUBY_1_8_x or RUBY_1_9_x).
By definition we're doing this conditional compilation because
we want to manipulate RString struct members directly, so we
tie the conditional compilation very tightly to the version
number. Older or newer versions will bail with an error.
Wincent Colaiuta [Tue, 12 May 2009 09:48:41 +0000 (11:48 +0200)]
Preliminary fix for Ruby 1.9 compatibility
The migration to the str_t type for the output buffer broke
compatibility with Ruby 1.9.
This commit shows what things would look like if we could
do a compile-time check for 1.9 and could thus modify the code
to make it compatible.
Unfortunately this doesn't actually work because although Ruby
comes with a "version.h" file with all the necessary macros,
for some reason it isn't installed when doing "make install".
In my local testing, that means that I can build under 1.9,
but it will fall back to the system-installed (1.8) headers
and the values for RUBY_VERSION_* are not correct.
Wincent Colaiuta [Mon, 11 May 2009 23:13:26 +0000 (01:13 +0200)]
Remove unnecessary "len" variables
Seeing as these expressions are short and are only evaluated once
there is no real need to store them in temporary variables; including
the expression in the relevant line is fine for readability due to
their short length.
This makes the variable initialization order of the
_Wikitext_append_sanitized_link_target function match
that of the _Wikitext_trim_link_text function.
Wincent Colaiuta [Mon, 11 May 2009 22:59:09 +0000 (00:59 +0200)]
Drop INVALID_ENCODING macro
Now that there is no conditional logic in the macro there is not
much justification for its existence. Use the literal call instead
seeing as it is so simple.
Wincent Colaiuta [Mon, 11 May 2009 21:30:01 +0000 (23:30 +0200)]
Rename _Wikitext_parser_* functions to _Wikitext_*
While it makes sense to name functions that are externally
exported to the "Ruby" side following the "Module_Class_*"
pattern, it's not really justified or necessary for functions
which are for internal use only.
This is especially the case for some of these functions whose
internal purpose and use and drifted away from the external
counterparts. (For example, the _Wikitext_append_sanitized_link_target
function which no longer mirrors the externally visible
Wikitext_parser_sanitize_link_target function.)
Wincent Colaiuta [Mon, 11 May 2009 19:11:12 +0000 (21:11 +0200)]
Reformat _Wikitext_utf8_to_utf32 for better readability
Reduce line lengths to make the _Wikitext_utf8_to_utf32
function more readable, most notably by splitting lengthy
condition expressions and bitwise-OR expressions across
multiple lines.
Wincent Colaiuta [Mon, 11 May 2009 19:00:20 +0000 (21:00 +0200)]
Avoid copying string backing when returning from parse function
This is a somewhat nasty hack to avoid making a copy of the output
when it comes time to return from the function. For the time being
it will only work with Ruby 1.8.x, or at least, 1.9.x hasn't been
tested yet.
Wincent Colaiuta [Mon, 11 May 2009 18:56:42 +0000 (20:56 +0200)]
Make _Wikitext_parser_sanitize_link_target return void
Avoid the creation of another temporary Ruby String instance
by appending directly to a buffer. As part of this change the
_Wikitext_parser_sanitize_link_target function has been renamed
to _Wikitext_parser_append_sanitized_link_target.
Wincent Colaiuta [Mon, 11 May 2009 17:23:02 +0000 (19:23 +0200)]
Refactor _Wikitext_utf32_char_to_entity (append to buffer)
Rename the _Wikitext_utf32_char_to_entity function to
_Wikitext_append_entity_from_utf32_char, teaching it to
append to a target buffer directly rather than creating
a temporary Ruby String instance.
I don't particularly like these low-level manipulations but
the main goal here is to avoid the extra allocation; a
subsequent commit will clean up.
Wincent Colaiuta [Sun, 10 May 2009 23:44:11 +0000 (01:44 +0200)]
Add sanity checks to parsing benchmark scripts
After the grand refactoring there are evidently still some lingering
low-level errors, because the benchmarking scripts are bailing with
an "overlong encoding" error after a certain period of time (full
output below).
I've added some sanity checks to the scripts to try and catch discrepancies
but so far none have been discovered.
Here is the full output of the run (this one for "parsing.rb", but the
results are similar for "profile_parsing.rb"):
Rehearsal -------------------------------------------------------------
short slab of ASCII text 1.800000 0.020000 1.820000 ( 2.182344)
short slab of UTF-8 text 3.540000 0.030000 3.570000 ( 4.127638)
longer slab of ASCII text 14.600000 0.140000 14.740000 ( 17.301072)
longer slab of UTF-8 text 46.150000 0.490000 46.640000 ( 58.118039)
--------------------------------------------------- total: 66.770000sec
user system total real
short slab of ASCII text 1.800000 0.020000 1.820000 ( 2.087143)
short slab of UTF-8 text 3.580000 0.040000 3.620000 ( 4.315676)
longer slab of ASCII text 14.680000 0.160000 14.840000 ( 18.018380)
longer slab of UTF-8 text benchmarks/parsing.rb:321:in `parse': invalid
encoding: overlong encoding (Wikitext::Parser::Error)
from benchmarks/parsing.rb:321:in `parse'
from benchmarks/parsing.rb:320:in `times'
from benchmarks/parsing.rb:320:in `parse'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/...
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/...
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/...
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/...
from benchmarks/parsing.rb:331
Wincent Colaiuta [Sun, 10 May 2009 16:19:12 +0000 (18:19 +0200)]
Overallocate for speed in str.c
One of the key motivations for switching to the str_t type internally
is that we can avoid allocations by re-using the same storage over and
over during the transformation.
We can avoid other allocations by overallocating when more storage
is requested, seeing as almost all requests for more storage will
later be followed by other requests.
At the moment, the original implementation is quite fast:
user system total real
short slab of ASCII text 1.440000 0.000000 1.440000 ( 1.445547)
short slab of UTF-8 text 2.900000 0.010000 2.910000 ( 2.927274)
longer slab of ASCII text 12.710000 0.040000 12.750000 ( 12.816209)
longer slab of UTF-8 text 35.210000 0.080000 35.290000 ( 35.661577)
The new implementation is actually slower, because we have had to add
some wasteful conversions back-and-forth between VALUE/String and str_t:
short slab of ASCII text 1.550000 0.000000 1.550000 ( 1.556956)
short slab of UTF-8 text 3.340000 0.010000 3.350000 ( 3.377874)
longer slab of ASCII text 15.410000 0.030000 15.440000 ( 15.484308)
longer slab of UTF-8 text 45.230000 0.130000 45.360000 ( 45.631355)
It is expected that the performance loss will be recovered once these
wasteful conversions are eliminated.
But before going that far, adding overallocation brings very large
improvements, enough to compensate for the inefficient conversions:
short slab of ASCII text 1.190000 0.010000 1.200000 ( 1.233443)
short slab of UTF-8 text 2.460000 0.020000 2.480000 ( 2.536714)
longer slab of ASCII text 11.000000 0.050000 11.050000 ( 11.208843)
longer slab of UTF-8 text 33.490000 0.130000 33.620000 ( 34.190941)
Those numbers use an overallocation constant of 256 bytes; will later
experiment with other constants to find the optimal overallocation.
Wincent Colaiuta [Fri, 8 May 2009 15:44:01 +0000 (17:44 +0200)]
Change 4 VALUE (String) members of the parser_t struct to str_t type
This is unfortuantely quite a large commit because the nature of
the change requires many parts to be modified at once; the
intermediate stages are not buildable and therefore not
bisectable.
Change the capture, output, link_target and link_text members of
the parser struct from VALUE (String) type to str_t. This should
improve performance because the str_t is faster and designed for
easy reuse so we can allocate a few instances at the beginning
of parsing and then use them repeatedly throughout the parse,
thus avoid many time-consuming allocations.
Remove the "capturing" member and instead use the "capture"
pointer as an indication of whether capturing is in progress.
Change the type of the "target" param in the
_Wikitext_pop_from_stack function (and the other "pop
from stack" functions) from VALUE (String) to pointer
to str_t.
Change the type of the "check_autolink" parameter to the
_Wikitext_append_hyperlink function from VALUE (boolean) to
bool.
Remove redundant passing in of parser->output to the
_Wikitext_pop_from_stack function.
Teach _Wikitext_blank to accept a pointer to a str_t struct
rather than a Ruby String (VALUE).
Add parser_new function to encapsulate the initial allocation
and initialization of the parser_t struct.
Rename str_append_rb_str function to str_append_string for
consistency with other functions in str.c.
Wincent Colaiuta [Fri, 8 May 2009 14:49:04 +0000 (16:49 +0200)]
Make parser struct participate in Ruby's Garbage Collection
Instead of having individual str_t and ary_t members participate
in Ruby's mark-and-sweep Garbage Collection, put the parser
struct on the stack and make the parser participate; it will
be responsible for cleaning up its own member resources when
it falls out of scope.
Wincent Colaiuta [Fri, 8 May 2009 14:08:03 +0000 (16:08 +0200)]
Add "capturing" member to parser struct
This is preparation for the eventual move of some, perhaps all,
of the members which are currently of String (VALUE) type to the
faster, more easily reused str_t type.
With the VALUE type we can check whether a member is initialized
or in use by doing a NIL_P(member) test.
This is not possible with the str_t type as it is a struct rather
than a pointer (although admittedly, we will be using a pointer to
the struct rather than the struct itself).
We don't want to dispose of the struct and set the pointer to NULL
because the whole point of reusing the str_t structs is that we
can allocate them only once at the start of parsing and then
use them over and over.
Likewise we don't want to abuse the "len" member of the srt_t
struct (for example, setting it to -1 to flag that it is not
in use), because it is not exactly intuitive or self-evident.
Similarly, we don't want to add an additional struct member (a
boolean) to indicate whether the struct is in use or not. The
struct itself shouldn't have to know or care about this; this
should be the responsibility of the caller using the struct.
So for now we set up this "capturing" bool so that we can
track when the parser is in capturing mode. The intention is
that in a later commit the "capture" member will become a
str_t instance (or a pointer to one).
Wincent Colaiuta [Fri, 8 May 2009 13:39:27 +0000 (15:39 +0200)]
Use C99 _Bool type
Seeing as we already compile in C99 mode, we may as well make use
of the _Bool type defined by that standard. We also include the
system "stdbool.h" header so as to have access to the "bool", "true"
and "false" convenience macros.
Wincent Colaiuta [Fri, 8 May 2009 12:50:31 +0000 (14:50 +0200)]
Improve efficiency of _Wikitext_pop_all_from_stack
Use a for-loop instead of repeatedly calling ary_entry inside
a while-loop. The simple integer comparison will be faster
than the function call. (And in any case, the
_Wikitext_pop_from_stack function which is called here will
do an ary_entry call anyway; so what's really happening here
with this change is that we call ary_entry once for each item
instead of twice.)
Wincent Colaiuta [Fri, 8 May 2009 12:10:54 +0000 (14:10 +0200)]
Reuse link_target if link_text is Qnil in _Wikitext_append_hyperlink
This cleans up a few call sites of _Wikitext_append_hyperlink. The
majority of these sites pass in the same text for the link target
and link text parameters, so teach the function to automatically
reuse the link target as the link text if no link text is explicitly
provided.
Wincent Colaiuta [Fri, 8 May 2009 12:00:31 +0000 (14:00 +0200)]
Minor clean-up in _Wikitext_rollback_failed_external_link
Minor reorganization to make _Wikitext_rollback_failed_external_link a
little cleaner. Avoid the almost identical calls to
_Wikitext_append_hyperlink and instead set up a link_class local
variable.
Wincent Colaiuta [Fri, 8 May 2009 11:31:01 +0000 (13:31 +0200)]
Refactor _Wikitext_rollback_failed_link function and friends
The _Wikitext_rollback_failed_link function now encapsulates the
common pattern of trying to roll back failed internal and external
links in a single function call.
On those occasions when we want to roll back only one type of link
we must instead use the _Wikitext_rollback_failed_internal_link or
_Wikitext_rollback_failed_external_link functions.
Wincent Colaiuta [Thu, 7 May 2009 23:03:44 +0000 (01:03 +0200)]
Teach _Wikitext_append_hyperlink to check the autolink setting
Rather than checking the autolink setting in the numerous sites
where _Wikitext_append_hyperlink is called, move the check into
the function itself, and pass a flag in specifying whether to
perform the check.
The overall saving here is about 8 lines thanks to the eliminated
repetition.
Wincent Colaiuta [Thu, 7 May 2009 22:52:05 +0000 (00:52 +0200)]
Remove temporary string variable from _Wikitext_hyperlink
Now that _Wikitext_hyperlink returns void there is no longer any
need to use a temporary String instance. Instead, we append
directly to the parser->output buffer, thus saving an allocation.
Wincent Colaiuta [Thu, 7 May 2009 22:30:54 +0000 (00:30 +0200)]
Add comment justifying the scope_includes_space variable
This comment serves as a reminder for why this variable exists
(to remember what was on the stack prior to popping); without
it the reader might ask "why do we have a temporary variable
here which is only used once?".
Wincent Colaiuta [Thu, 7 May 2009 15:57:32 +0000 (17:57 +0200)]
Remove stale comment
This comment is a left-over from the distant past when many of
these functions were explicitly marked as inline functions,
rather than letting the compiler decided when to inline.
Wincent Colaiuta [Thu, 7 May 2009 09:35:57 +0000 (11:35 +0200)]
Use absolute paths in internal "requires"
Ensure that when locally testing or otherwise using a specific
version of the extension that the files included using "require"
come from the same version and not from some other version in the
load path.
For example, prior to this commit, doing an:
irb -r ext/wikitext lib/wikitext/string
Would not produce the desired result. First the local copy of the
extension would be loaded, then the local "lib/wikitext/string",
but then the latter would do a "require 'wikitext/parser'", which
would load the first corresponding file in the load path (usually
the latest installed gem), which would in turn do a "require
'wikitext'" and end up loading the first corresponding file in
the load path.
Wincent Colaiuta [Wed, 6 May 2009 23:03:58 +0000 (01:03 +0200)]
Specify ":indent => false" default in wikitext/string extension
Seeing as the String extension is primarily for use in Rails
applications, where setting up Haml to run with "ugly" mode turned
on is a good idea, it makes sense to make the "w" and "to_wikitext"
methods on the String class pass in ":indent => false" by default.
This can be overridden if desired by passing in an explicit indent
such as ":indent => 0".