This document contains a list of outstanding issues that need resolution before a final release. The development team is soliciting feedback and suggestions on these issues. Please send your comments to html-tidy@w3.org.
Status: Resolved
Should current UTF16 support be included in the current release?
Issues w/ inclusion were correct/complete functionality and appropriate
configuration interface. Apparently, functionality issues such as surrogate handling,
default byte order, byte order detection and byte order mark output have been addressed.
Also, the current command line interface is as an argument to the
--char-encoding, --input-encoding and
--output-encoding options, which was based on feedback.
Status: Open
Should current Asian character encoding support be included in the current release?
Currently Tidy does not transcode ISO-1022, Shift-JIS or Big5 encodings into Unicode. There appears to be consensus that, long term, it can and should. That said, there are several, slightly different mapping tables between Unicode, Shift-JIS and ISO-1022. See http://www.w3.org/TR/japanese-xml for details.
However, it may turn out to be a good thing that Tidy does not transcode. By keeping the text in the original encoding internally, Tidy avoids any data loss due to encoding translation.
Another issue is that sample documents are needed for complete testing. If you have HTML documents in any of these encodings, and can share them with us, it would be much appreciated.
Status: Open
There are a number of problems / open issues surrounding
the escaping of <script> and <style> tags when
producing XHTML output. For those just tuning in, the basic
issue is that browser scripts will often contain special XML
characters: '&', '<',
']]>' and '<' + '/' + Letter.
If these are escaped to make XML processors happy, it will break the script. The agreed solution is to place <script> source within a CDATA section. This is now done for both <script> and <style> tags. So far, so good. But there are a number open issues and possible unintended consequences.
A primary source of the side-effects is that the CDATA begin and end
markers must be commented out in the source of the script or stylesheet.
In addition, script source is often embedded in HTML comments to prevent
parsing by older browsers that do not support Javascript. Although these
browsers are exceedingly rare in the field these days, a large body of
HTML pages are coded using this technique. Finally, embedding script
source in HTML comments is sometimes used - with IE only - as a way of
hiding instances of '<' + / + Letter from the browser.
Currently, only CSS stylesheets are supported. It has been noted that the offending special characters rarely appear, if ever, in CSS source.
Currently, detection is not attempted, nor is any attempt made to hide such constructs. This sequence marks the end of a CDATA section in SGML and, thus is invalid HTML. One option is to hide the script source within HTML comments (within script comments, within the CDATA section). Another is to emit an error if such a sequence is detected. The problem with emitting an error is that user intervention is required. The problem with HTML comments is that it only works for IE and is complicated to implement correctly for multiple scripting languages which leads us to ...
It has been suggested that CDATA escaping support for all scripting languages besides Javascript be dropped to make hiding logic described above a tractable proposition. Currently, Javascript, JScript, ecmascript and VBScript are supported. Other candidates include perlscript and ruby.
The whole purpose of CDATA escaping is for XML compatibility. It is not necessary for scripts to work with most current browsers.
The most popular general purpose XML tool that might be applied to XHTML pages is probably XSLT. XSLT can treat certain element types as CDATA sections. But all instances of a type, <script> say, will be treated the same: CDATA or escaped. So there is no benefit to embedding some scripts CDATA sections but not others.
But users may also write DOM and/or SAX based applications. Both of which may, optionally, preserve CDATA sections. By default, however, SAX will not preserve CDATA sections.
The practical effect is whether all <script> tags should be CDATA escaped or only those that absolutely require it. The impact is primarily on the code maintainer, who must either read through the messier code or be sure to re-Tidy the file and/or add CDATA sections by hand.