Tuesday, April 17, 2012

A Data::Dumper bug - or Latin1 strikes again

I did not believe my boss that he found a bug in Data::Dumper - but run this:

To get a character that has internal representation in Latin1 I could also use HTML::Entities::decode( 'ó' ) there with the same result. The output I get on 5.14.2 and 5.10.1 is:

When I check the dumped string - it has the right character encoded in Latin1 - and apparently eval expects UTF8 when use utf8 is set. Without use utf8 eval works OK on it. If the internal representation of the initial character is UTF8 (like when the first line is my $initial = 'ó';) - then the dumped string contains UTF8 (which is again might be interpreted incorrectly if the code does not have use utf8 preamble).

Considering that Data::Dumper is a core module and one that is one of the most commonly used and that its docs say:

The return value can be evaled to get back an identical copy of the original reference structure.
this looks like a serious bug.

Is that a known problem? Should I post it to the Perl RT?

Update: Removed the initial eval - "\x{f3}" is enough to get the Latin1 encoded character. Some editing.
Update: I tested it also on 5.15.9 and it fails in the same way.
Update: I've reported it to the Perl RT - I am not sure about the severity chosen and the subject - this was my first Perl bug report.
Update: In reply to the ticket linked above Father Chrysostomos explains: "The real bug here is that ‘eval’ is respecting the ‘use utf8’ from outside it." and later adds that 'use v5.16' will fix the problem in 5.16.


Sebastian said...

Note that \x.. (no {} and only two hexadecimal digits), \x{...}, and chr(...) for arguments less than 0x100 (decimal 256) generate an eight-bit character for backward compatibility with older Perls. For arguments of 0x100 or more, Unicode characters are always produced. If you want to force the production of Unicode characters regardless of the numeric value, use pack("U", ...) instead of \x.., \x{...}, or chr(). [From perluniintro]

Replacing the first eval with

 use charnames ':full';

 my $initial = eval '"\N{LATIN SMALL LETTER O WITH ACUTE}"';

fixes your problem.

Sebastian Willert

zby said...

Thanks for the explanation! That indeed explains how the eval works - but I don't agree that it fixes the problem - because the problem is that Dumper does not work correctly on characters that have internal Latin1 representation - not about where you get the character from. You can get it from the eval as above - or you can get it from HTML::Entities or possibly from many other sources (and note that Data::Dumper itself dumps 'ó' as \x{f3} - so it only takes two Dumper/eval iterations to get to the problem). The internal representation should not matter.

Sebastian said...

More on this can be found in perlunifaq. You can also "use encoding::warnings;" to see what's going on behind the scene.