UTF-8 JavaScript files [JavaScript]

Prev: Sound-synched movie
Next: Energy Saving Tips

From: Hans-Georg Michna on 3 Jul 2010 15:03

I'm having a problem with a UTF-8 HTML page containing a
<script> tag that calls in a JavaScript file that is also
encoded in UTF-8.

The JavaScript program, among other things, contains a string
literal, which contains an umlaut, and dynamically puts the
string into an HTML tag. But the umlaut is not displayed
properly and displays as a little square box instead. What could
be the cause of this problem?

Am I right in assuming that a JavaScript file inserted by means
of the <script> tag is interpreted as being encoded in the same
character set as the HTML page itself? If so, then I have to
search for the error elsewhere.

I haven't gotten to any more thorough analysis yet. Thought I
should ask here first, just in case there are a few well-known
potential causes.

Hans-Georg

From: Richard Cornford on 3 Jul 2010 15:36

Hans-Georg Michna wrote:
> I'm having a problem with a UTF-8 HTML page containing a
> <script> tag that calls in a JavaScript file that is also
> encoded in UTF-8.
>
> The JavaScript program, among other things, contains a
> string literal, which contains an umlaut, and dynamically
> puts the string into an HTML tag. But the umlaut is not
> displayed properly and displays as a little square box
> instead. What could be the cause of this problem?
>
> Am I right in assuming that a JavaScript file inserted by
> means of the <script> tag is interpreted as being encoded
> in the same character set as the HTML page itself?

Without a reference to an HTML spec saying as much that would be an
assumption, although not an unreasonable one as it would be a sensible
strategy. Though I would expect the above description to assert that you
have examined the HTML traffic (using an HTTP monitor/proxy such as
Fiddler or Charles) and verified first that the javascript is being
served to appropriate content type headers (either asserting UTF-8, or
at least not contradicting it), and second, that the actual bytes being
sent includes the correct sequence of bytes for the UTF-8 encoding of
the offending character (by looking at the hex representation of the
resource in the HTTP monitor).

> If so, then I have to search for the error elsewhere.
>
> I haven't gotten to any more thorough analysis yet. Thought I
> should ask here first, just in case there are a few well-known
> potential causes.

If nothing else, trying the SCRIPT element with an explicit CHARSET
attribute (asserting UTF-8) might prove instructive.

Richard.

From: johncoltrane on 3 Jul 2010 15:59

Le 03/07/10 21:03, Hans-Georg Michna a écrit :
> I'm having a problem with a UTF-8 HTML page containing a
> <script> tag that calls in a JavaScript file that is also
> encoded in UTF-8.
>
> The JavaScript program, among other things, contains a string
> literal, which contains an umlaut, and dynamically puts the
> string into an HTML tag. But the umlaut is not displayed
> properly and displays as a little square box instead. What could
> be the cause of this problem?
>
> Am I right in assuming that a JavaScript file inserted by means
> of the<script> tag is interpreted as being encoded in the same
> character set as the HTML page itself? If so, then I have to
> search for the error elsewhere.
>
> I haven't gotten to any more thorough analysis yet. Thought I
> should ask here first, just in case there are a few well-known
> potential causes.
>
> Hans-Georg

AFAIK JavaScript is supposed to be UTF-8 compatible. You can even use
japanese hiragana as variable names.

I just ran a few quick tests in Firefox with the factory default charset
(iso-8859-1).

relevant HTML:

<script src="js.js" type="text/javascript"></script> (no charset)
or
<script src="js.js" type="text/javascript" charset="utf-8"></script>

and

<body onload="init();">
<p id="txt"></p>
</body>

relevant JS:

function init()
{
ぢ = "✍xvbc;,wxjhgdkqsj¬ﬁÌÏﬁƒ¬Ò÷ß∂ƒÒÈ∂ºÒÌƒßÒ÷È∂ƒßÈº∂Ì≠¬ÏîÂÏ";
document.getElementById('txt').innerHTML = ぢ;
};

page charset | script charset | var ぢ = 'ü' | var txt = 'ü'
-------------+----------------+--------------+--------------
none | none | parse error | garbled glyphs
none | utf-8 | works | works
utf-8 | none | works | works
utf-8 | utf-8 | works | works
iso-8859-1 | none | parse error | garbled glyphs
iso-8859-1 | utf-8 | works | works

Soooo... I'm not sure why you would get a garbled glyph if at least the
HTML document is in utf-8.

--
(ôlô)

From: Thomas 'PointedEars' Lahn on 3 Jul 2010 17:05

johncoltrane wrote:

> AFAIK JavaScript is supposed to be UTF-8 compatible.

You know nonsense; partially because you don't know what JavaScript is,
partially because you don't know what UTF-8 is.

,-[ECMAScript Language Specification, Edition 5 Final Draft]
|
| A conforming implementation of this International standard shall interpret
| characters in conformance with the Unicode Standard, Version 3.0 or later
| and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding
| form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not
| otherwise specified, it is presumed to be the BMP subset, collection 300.
| If the adopted encoding form is not otherwise specified, it presumed to be
| the UTF-16 encoding form.

The key phrase here being "If the adopted encoding form is not otherwise
specified". See below.

> You can even use japanese hiragana as variable names.

That is a subset of a character set (Unicode), not an encoding (UTF-8).
Learn to understand the difference.

> I just ran a few quick tests in Firefox with the factory default charset

Nonsense. Obviously you don't know what "charset" means to begin with.

> (iso-8859-1).

That is a character encoding, and its being the *HTTP default* in reality
is heavily overrated. And there is *no* default value for the `charset'
attribute specified in HTML.

> relevant HTML:
>
> <script src="js.js" type="text/javascript"></script> (no charset)
> or
> <script src="js.js" type="text/javascript" charset="utf-8"></script>

As specified, HTTP header information and “A META declaration with "http-
equiv" set to "Content-Type" and a value set for "charset"” take precedence
over this attribute and related attributes.

<http://www.w3.org/TR/REC-html40/charset.html>

> Soooo... I'm not sure why you would get a garbled glyph if at least the
> HTML document is in utf-8.

Because one has nothing to do with the other. It is the declaration of the
encoding of the resources in the HTTP Content-Type header (no, _not_ meta)
that matters most; everything else only matters if it is *missing*. And
there are still some stupid server administrators that have `Content-Type:
....; charset=ISO-8859-1' sent by default (a default configuration bug that
was fixed for Apache years ago¹).

Learn to quote.

PointedEars
___________
¹ <https://issues.apache.org/bugzilla/show_bug.cgi?id=23421>
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$8300dec7(a)news.demon.co.uk> (2004)

From: johncoltrane on 3 Jul 2010 18:18

>> AFAIK JavaScript is supposed to be UTF-8 compatible.
>
> You know nonsense; partially because you don't know what JavaScript is,
> partially because you don't know what UTF-8 is.
>
> ,-[ECMAScript Language Specification, Edition 5 Final Draft]
> |
> | A conforming implementation of this International standard shall interpret
> | characters in conformance with the Unicode Standard, Version 3.0 or later
> | and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding
> | form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not
> | otherwise specified, it is presumed to be the BMP subset, collection 300.
> | If the adopted encoding form is not otherwise specified, it presumed to be
> | the UTF-16 encoding form.
>
> The key phrase here being "If the adopted encoding form is not otherwise
> specified". See below.
>
>> You can even use japanese hiragana as variable names.
>
> That is a subset of a character set (Unicode), not an encoding (UTF-8).
> Learn to understand the difference.

I know the difference. It was an example : variable names in non-ascii
characters do work in... that mostly browser centric scripting language.

Think of it as a preemptive illustration of your rebuttal.

>> I just ran a few quick tests in Firefox with the factory default charset
>
> Nonsense. Obviously you don't know what "charset" means to begin with.
>
>> (iso-8859-1).
>
> That is a character encoding, and its being the *HTTP default* in reality
> is heavily overrated. And there is *no* default value for the `charset'
> attribute specified in HTML.

Well, what I know is that when talking about HTML, the difference
between "character set" and "encoding" is practically non-existent, both
words being used (wrongly, I give you that) interchangeably. Also I was
referring to the default settings of Firefox, here.

HTML has only the "charset" attribute and it's not supposed to accept
"Unicode" or "Hiragana" or "Occidental" as value. We are left with
"utf-8" (the most widely used way of representing the full/most of the
Unicode standard, including Hiragana) or "iso-8859-1" or a slew of other
possibilities.

Hell, in XML/XHTML we even have to use both terms.

> As specified, HTTP header information and “A META declaration with "http-
> equiv" set to "Content-Type" and a value set for "charset"” take precedence
> over this attribute and related attributes.
>
> <http://www.w3.org/TR/REC-html40/charset.html>

I thought it was possible to override the HTTP header with in-document
declarations. Thanks.

>> Soooo... I'm not sure why you would get a garbled glyph if at least the
>> HTML document is in utf-8.
>
> Because one has nothing to do with the other. It is the declaration of the
> encoding of the resources in the HTTP Content-Type header (no, _not_ meta)
> that matters most; everything else only matters if it is *missing*. And
> there are still some stupid server administrators that have `Content-Type:
> ...; charset=ISO-8859-1' sent by default (a default configuration bug that
> was fixed for Apache years ago¹).

Yes. I can't even remember ever seeing this default. But I'm not an
old-timer.

That said "HTML document is in utf-8" was too unspecific. I was thinking
about the HTTP header. Sorry.

> Learn to quote.

Like that?
--
(ôlô)

| Next | Last
Pages: 1 2 3 4
Prev: Sound-synched movie
Next: Energy Saving Tips