Issues List for September, 2001 Release

This document contains a list of outstanding issues that need resolution before a final release. The development team is soliciting feedback and suggestions on these issues. Please send your comments to html-tidy@w3.org.

  1. UTF16 Character Encoding

    Status: Resolved

    Should current UTF16 support be included in the current release?

    Issues w/ inclusion were correct/complete functionality and appropriate configuration interface. Apparently, functionality issues such as surrogate handling, default byte order, byte order detection and byte order mark output have been addressed. Also, the current command line interface is as an argument to the --char-encoding, --input-encoding and --output-encoding options, which was based on feedback.

  2. Asian Character Encodings: ISO-1022, ShiftJIS and Big5

    Status: Open

    Should current Asian character encoding support be included in the current release?

    Currently Tidy does not transcode ISO-1022, Shift-JIS or Big5 encodings into Unicode. There appears to be consensus that, long term, it can and should. That said, there are several, slightly different mapping tables between Unicode, Shift-JIS and ISO-1022. See http://www.w3.org/TR/japanese-xml for details.

    However, it may turn out to be a good thing that Tidy does not transcode. By keeping the text in the original encoding internally, Tidy avoids any data loss due to encoding translation.

    Another issue is that sample documents are needed for complete testing. If you have HTML documents in any of these encodings, and can share them with us, it would be much appreciated.

  3. Escaping <script> and <style> XHTML

    Status: Open

    There are a number of problems / open issues surrounding the escaping of <script> and <style> tags when producing XHTML output. For those just tuning in, the basic issue is that browser scripts will often contain special XML characters: '&', '<', ']]>' and '<' + '/' + Letter.

    If these are escaped to make XML processors happy, it will break the script. The agreed solution is to place <script> source within a CDATA section. This is now done for both <script> and <style> tags. So far, so good. But there are a number open issues and possible unintended consequences.

    A primary source of the side-effects is that the CDATA begin and end markers must be commented out in the source of the script or stylesheet. In addition, script source is often embedded in HTML comments to prevent parsing by older browsers that do not support Javascript. Although these browsers are exceedingly rare in the field these days, a large body of HTML pages are coded using this technique. Finally, embedding script source in HTML comments is sometimes used - with IE only - as a way of hiding instances of '<' + / + Letter from the browser.

    1. Is CDATA escaping necessary for <style> tags?

      Currently, only CSS stylesheets are supported. It has been noted that the offending special characters rarely appear, if ever, in CSS source.

    2. How should '<' + '/' + Letter appearing in script source be handled?

      Currently, detection is not attempted, nor is any attempt made to hide such constructs. This sequence marks the end of a CDATA section in SGML and, thus is invalid HTML. One option is to hide the script source within HTML comments (within script comments, within the CDATA section). Another is to emit an error if such a sequence is detected. The problem with emitting an error is that user intervention is required. The problem with HTML comments is that it only works for IE and is complicated to implement correctly for multiple scripting languages which leads us to ...

    3. What scripting languages should be supported?

      It has been suggested that CDATA escaping support for all scripting languages besides Javascript be dropped to make hiding logic described above a tractable proposition. Currently, Javascript, JScript, ecmascript and VBScript are supported. Other candidates include perlscript and ruby.

    4. What XML processors should be supported?

      The whole purpose of CDATA escaping is for XML compatibility. It is not necessary for scripts to work with most current browsers.

      The most popular general purpose XML tool that might be applied to XHTML pages is probably XSLT. XSLT can treat certain element types as CDATA sections. But all instances of a type, <script> say, will be treated the same: CDATA or escaped. So there is no benefit to embedding some scripts CDATA sections but not others.

      But users may also write DOM and/or SAX based applications. Both of which may, optionally, preserve CDATA sections. By default, however, SAX will not preserve CDATA sections.

      The practical effect is whether all <script> tags should be CDATA escaped or only those that absolutely require it. The impact is primarily on the code maintainer, who must either read through the messier code or be sure to re-Tidy the file and/or add CDATA sections by hand.