Sunday, August 28, 2011

is_utf8 is useless - can we have is_character?

Consider this code:

$data_structure = utf8::is_utf8($json)
? from_json($json)
: decode_json($json);

taken, together with the is_character suggestion, from otherwise very informative post: Quick note on using module JSON. I have seen similar code in many places. The idea is to check if the string you have is character data or a string of bytes and treat it appropriately. Unfortunately is_utf8 does not do that check:

use strict;
use warnings;

use utf8;
use HTML::Entities;
use JSON;

my $a = HTML::Entities::decode( ' ' );
my $json = qq{{ "a": "$a" }};
print 'is_utf8: ' . ( utf8::is_utf8( $json ) ? 'yes' : 'no' ) . "\n";

my $data_structure = utf8::is_utf8( $json )
? from_json( $json )
: decode_json( $json );


This fails (on my machine) with following output:

is_utf8: no
malformed UTF-8 character in JSON string, at character offset 8 (before "\x{8a0}" }") at a.pl line 12.

If that still is a mystery try this:

use strict;
use warnings;

use HTML::Entities;
use Devel::Peek;

Dump( HTML::Entities::decode( ' ' ) );

the output (on my machine) is:

SV = PV(0x24f2090) at 0x24f3de8
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x2501620 "\240"\0
CUR = 1
LEN = 16

this string is internally encoded as "\240" i.e. "\x{0a}" which is Latin1 encoding of non-breaking space. It does not have the utf8 flag set - so the code above tries to treat it as UTF8 encoded stream of bytes and fails.

I don't know if we can have is_character easily - but the lack of introspection here is surely painful.

9 comments:

Sid Burn said...

utf8::is_utf8() checks if a string has the utf8 flag set, but it didn't catch error in your code.

You need to write.

my $data_structure = utf8::is_utf8( $json )
? decode_json($json)
: from_json($json);

Use from_json() with strings without utf8 set, and decode_json() with utf8 string. You did it the other way. You used non-utf8 strings with decode_json() anf that fails. Change the line and your code works.

zby said...

If you think so - then try this:

use strict;
use warnings;

use utf8;
use HTML::Entities;
use JSON;

my $a = HTML::Entities::decode( '€' );
my $json = qq{{ "a": "$a" }};
print 'is_utf8: ' . ( utf8::is_utf8( $json ) ? 'yes' : 'no' ) . "\n";

my $data_structure = decode_json( $json );
print "Done\n";

The output is:

is_utf8: yes
Wide character in subroutine entry at a.pl line 12.

and no 'Done' - it fails as well.

Sid Burn said...

It fails from the JSON module, it is documented, and i had an error myself. decode_json() wants an UTF-8 binary string, that means a string without utf-8 flag set. You get that normaly if you read I/O and did nat an decode() before on the string. decode_json() will then decode it and assums it is UTF-8 if you already pass an UTF-8 string it will die. The correct way:

use strict;
use warnings;
use utf8;
use HTML::Entities;
use JSON;
use Encode;
use Devel::Peek;

my $a = HTML::Entities::decode( '€' );
my $json = qq{{ "a": "$a" }};
check_utf8($json);

# decode_json needs UTF-8 but not in internal UTF-8 format
my $ds1 = decode_json( encode('UTF-8',$json) );
print "Done1\n";

# if you have internal utf8 form, you need to set utf8(0)
my $ds2 = JSON->new->utf8(0)->decode($json);
print "Done2\n";

# Printing character to an UTF-8 Terminal
print encode("UTF-8", $ds1->{a}), "\n";
print encode("UTF-8", $ds2->{a}), "\n";

sub check_utf8 {
print "is_utf8: ";
print utf8::is_utf8($_[0]) ? "yes" : "no";
print "\n";
}

Sid Burn said...

Here is a full commented example. But the real problem why this is so complicated is that HTML::Entities returns sometimes Unicode, sometimes not and it doesn say in which encoding it returns anything.

use strict;
use warnings;
use utf8;
use HTML::Entities;
use JSON;
use Encode;
use Devel::Peek;

# returns Unicode string with internal UTF-8 represantation
my $a = HTML::Entities::decode('€');
print '\$a: '; check_utf8($a);

# returns ISO-8859-1 string with ISO-8859-1 represantation
# HTML::Entities is just crap, well i assume it is
# ISO-8859-1 string, because it is not documented what it
# returns
my $b = HTML::Entities::decode(' ');
print '\$b: '; check_utf8($b);

# build the json strings
my $json1 = qq{{ "a": "$a" }};
my $json2 = qq{{ "b": "$b" }};
print '$json1: '; check_utf8($json1);
print '$json2: '; check_utf8($json2);

# $json1 is Unicode string so we need to do
my $ds1 = JSON->new->utf8(0)->decode($json1);

# $json2 is ISO-8859-15 with ISO-8859-1 representation
# decode_json is wrong because it needs valid UTF-8
# without UTF-8 flag. We can for example decode it
# manually and then pass it to to JSON. I assume that
# HTML::Entities return ISO-8859-1 but i have no clue
# what it returns it isn't documented.
my $ds2 = JSON->new->utf8(0)->decode( decode('ISO-8859-1', $json2) );

# Another possibility is to encode it in internal unicode
# representation and back to binary UTF-8. Then
# decode_json is working
my $ds3 = decode_json(encode('UTF-8',decode('ISO-8859-1',$json2)));

# Expecting an UTF-8 Terminal
print "[", encode("UTF-8", $ds1->{a}), "]\n";
print "[", encode("UTF-8", $ds2->{b}), "]\n";
print "[", encode("UTF-8", $ds3->{b}), "]\n";

sub check_utf8 {
print "is_unicode: ";
print utf8::is_utf8($_[0]) ? "yes" : "no";
print "\n";
}

zby said...

from_json($json)

will work for both

HTML::Entities::decode( '€' )

and

HTML::Entities::decode( ' ' )

because both are correct character data.

Sid Burn said...

Only these example will work. If you have an UTF-8 string and never decode() it before passing to from_json you get wrong results. You can either use decode_json() in this case or manually do it. A new example with comments that explain it.


use strict;
use warnings;
use utf8;
use Encode;
use JSON;
use HTML::Entities;

my $a = HTML::Entities::decode('€');
my $b = HTML::Entities::decode(' ');
my $c = "Helö";
my $d = encode("UTF-8", $c);

my $ja = qq{{ "a" : "$a" }};
my $jb = qq{{ "b" : "$b" }};
my $jc = qq{{ "c" : "$c" }};
my $jd = qq{{ "d" : "$d" }};

my $da = from_json($ja);
my $db = from_json($jb);
my $dc = from_json($jc);
# no error but its wrong it is technical the same as
# my $dd = JSON->new->utf8(0)->decode($jd);
my $dd = from_json($jd);

# what does utf8(0) mean? The behavior changes what an
# encode/decode (from JSON module) do or expect. If you do
# utf8(0) then an decode expect that you pass an unicode string
# with utf8 flag, but we didn't pass one. It doesn't
# break because it can be valid, but it doesnt mean
# that it is correct.

# correct would be to use decode_json() in this case.
# my $dd = decode_json($jd);

print encode("UTF-8", $da->{a}), "\n";
print encode("UTF-8", $db->{b}), "\n";
print encode("UTF-8", $dc->{c}), "\n";
# And now this will be wrong, you now have a double
# encoded UTF-8 String.
print encode("UTF-8", $dd->{d}), "\n";

zby said...

Yeah - I actually agree that you should do from_json on character (that is decoded) data and decode_json on bytes containing utf8 encoded string. My point is that you cannot decide what you get by using is_utf8.

Anonymous said...

Sid said:

"But the real problem why this is so complicated is that HTML::Entities returns sometimes Unicode, sometimes not and it doesn say in which encoding it returns anything."

Amen! I'm not even trying to use JSON and this is miserable. Could someone be so kind as to explain if there is a simple solution for making HTML::Entities not do something bizarre on the non-breaking space character? I had read elsewhere that you should pass it the character encoding so it knows, but I don't see anything in the docs that says you can do that.

zby said...

I don't think HTML::Entities is the only library that does this - because it is correct according to the current specification. But I do agree that life would be easier if we just got rid of the Latin1 encoding.