What's all this about TidyLib?

TidyLib, like it sounds, is a library version of Dave Raggett's popular HTML Tidy. In fact, one of the motivations for starting the Source Forge project was to refactor HTML Tidy as a callable library. Although the command line tool is great, it is difficult and inefficient to integrate into other software.

Requirements

We had several informal requirements for the library:

You Can Get There From Here

Probably the most important requirement is that the library be easy to integrate. Because of the almost universal adoption of C linkage, a C interface may be called from a great many programming languages. This, and the fact that code was already in C and the team was already most comfortable with C, led to the decision that the library's public interface should be kept in C.

The other major design decision was to use opaque types in the public interface. This allows the application to just pass in integer around and the need to transform data types in different languages is minimized.

This strategy has already paid off. It was straight-forward to write very thin library wrappers for C++, Pascal, and COM/ATL. It was also quick to generate a Perl wrapper using SWIG. SWIG wrappers for Python, Ruby, Java and others should also be possible.

Don't Break Anything

Of course, Tidy must remain Tidy. It wasn't acceptable to introduce bugs or drop (many) features. In the end, the body of test documents proved invaluable to getting things working.

Thread Safe / Reentrant

Because there are many uses for HTML Tidy - from content validation, content scraping to conversion to XHTML - it was important to make TidyLib run reasonably well within server applications as well as client side.

This requirement implies that the library be fully re-entrant so that it may be used within multi-threaded applications.

Adaptable I/O

As part of the larger integration strategy, it was decided to fully abstract all I/O. This means a (relatively) clean separation between character encoding processing and shovelling bytes back and forth. Internally, the library reads from "sources" and writes to "sinks". This abstraction is used for both markup and configuration "files". Concrete implementations are provided for file and memory I/O. But new sources and sinks may be provided via the public interface.

We had some prior art to follow as well. Most notably, Marc-Andre Lemburg's mxTidy. In the process of writing a Python wrapper for Tidy, Marc-Andre applied these principles and built a C library. TidyLib can be seen as a completion of Marc's work.

Getting Started

Get The Source

The best way to get the lib sources is directly from CVS. If you have CVS installed (recommended!), just execute the following commands:

C:\src> mkdir tidylib
C:\src> cd tidylib
C:\src\tidylib> set TIDYCVSROOT=:pserver:anonymous@cvs.sourceforge.net:/cvsroot/tidy
C:\src\tidylib> cvs -d %TIDYCVSROOT% login
C:\src\tidylib> cvs -d %TIDYCVSROOT% export -d C:\src\tidylib -r HEAD _
 build console htmldoc include src test

When CVS prompts you for the password, just hit ENTER. The underscore (_) above denotes line continuation. Do not type it in, just use one long command line. The procedure is similar for Unix variants. Just translate to the appropriate path separator for your file system and do not use the -d <dir> option. Copy and paste the above into a script or batch file. For the truly lazy, you can pull a gzipped source tarball from the Tidy Project Page.

Build It

For an overview of build options, see build/readme.txt. It describes the overall layout and more info on supported build systems.

Unix / GNU

For GNU gcc, just use the gmake build/gmake/Makefile. The usual target is all. If you want a debug build, use the debug target. For other Unix compilers, you may have to set the CC macro to point to your compiler, usually just cc. The same, large number of Unix systems are supported "out of the box" as Tidy Classic. Tidy usually does a good job of automatically identifying the current platform. If not, tweak platform.h as needed and send us a patch!

If you are using GCC/MinGW, you should use gmake as well.

In addition, there are targets for clean and install. Be sure to look at the Makefile before using install to make sure the binaries, headers and library go where you want. By default, /usr/bin, /usr/include, and /usr/lib, respectively. There are macros in the Makefile to customize your installation.

make all

Windows / Visual C++

For VC++, use you can use either msvc/Makefile.vc6 on the command line or build/msvc/tidy.dsw in the IDE. As the names imply, these work with Visual C++ version 6.0. Service Pack 3 is highly recommended. Makefile.vc6 supports the same targets: all, debug, clean and install are all available.

nmake /f Makefile.vc6 all

GNU AutoConf/AutoMake

The input files to drive the GNU AutoConf tool set have been added. See build/gnuauto/readme.txt for instructions on how to use GNU build tools with Tidy.

Example

Perhaps the easiest way to understand how to call Tidy is to see a simple program that uses it. A basic thing to know about the API is that functions that return an integer use the following values:

0 == Success

Good to go.

1 == Warnings, No Errors

Check error buffer or track error messages for details.

2 == Errors and Warnings

By default, Tidy will not produce output. You can force output with the TidyForceOutput option. As with warnings, check error buffer or track error messages for details.

<0 == Severe error

Usually value equals -errno. See errno.h.

Also, by default, warning and error messages are sent to stderr. You can redirect diagnostic output using either tidySetErrorFile() or tidySetErrorBuffer(). See tidy.h for details.

#include <tidy.h>
#include <buffio.h>
#include <stdio.h>
#include <errno.h>


int main(int argc, char **argv )
{
  const char* input = "<title>Foo</title><p>Foo!";
  TidyBuffer output = {0};
  TidyBuffer errbuf = {0};
  int rc = -1;
  Bool ok;

  TidyDoc tdoc = tidyCreate();                     // Initialize "document"
  printf( "Tidying:\t%s\n", input );

  ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes );  // Convert to XHTML
  if ( ok )
    rc = tidySetErrorBuffer( tdoc, &errbuf );      // Capture diagnostics
  if ( rc >= 0 )
    rc = tidyParseString( tdoc, input );           // Parse the input
  if ( rc >= 0 )
    rc = tidyCleanAndRepair( tdoc );               // Tidy it up!
  if ( rc >= 0 )
    rc = tidyRunDiagnostics( tdoc );               // Kvetch
  if ( rc > 1 )                                    // If error, force output.
    rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
  if ( rc >= 0 )
    rc = tidySaveBuffer( tdoc, &output );          // Pretty Print

  if ( rc >= 0 )
  {
    if ( rc > 0 )
      printf( "\nDiagnostics:\n\n%s", errbuf.bp );
    printf( "\nAnd here is the result:\n\n%s", output.bp );
  }
  else
    printf( "A severe error (%d) occurred.\n", rc );

  tidyBufFree( &output );
  tidyBufFree( &errbuf );
  tidyRelease( tdoc );
  return rc;
}

Look Ma, no temp files!

Application Notes

Of course, there are functions to parse and save both markup and configuration files. For the adventurous, it is possible to create new input sources and output sinks. For example, a URL source could pull the markup from a given URL.

It is also worth rememebering that an application may instantiate any number of document and buffer objects. They are fairly cheap to initialize and destroy (just memory allocation and zeroing, really), so they may be created and destroyed locally, as needed. There is no problem keeping them around a while for keeping state. For example, a server app might keep a global document as a master configuration. As documents are parsed, they can copy their configuration data from the master instance. See tidyOptCopyConfig(). If the master copy is initialized at startup, no synchronization is necessary.

API Docs

A first draft of API Docs have been added to Tidy header files and generated using Doxygen.

Nightly Build

The build procedures on the Source Forge Compile Farm have been updated to produce the command line driver based on the library sources. See Tidy Binaries.

Future Directions

The ink isn't dry yet on TidyLib and already folks want more! Well, waddaya expect? Several ideas have been discussed on the dev mailing list.

Character Encoding

Currently, all character encoding support is hard wired into the library. This means we do a poor job of supporting many popular encodings such as GB2312, euc-kr, eastern European languages, cyrillic, etc. Any of these languages must first be transcoded into ISO-10646/Unicode before Tidy can work with it.

Two basic approaches have been proposed: just use iconv or adapt Clark Coopers's XML::Encoding as a callable library. On the face of it, iconv is preferable. Because it is GPL'ed, however, the license may be incompatible. Also, there are transcription issues related to Big5 and other code sets that may or may not be addressed by iconv. XML::Encoding, otoh, uses the Perl Artistic License and explicitly supports all alternate transcriptions for Big5 and others. For more info, see CPAN and Tidy Issues.

Error Handling

  • Categorize errors
  • Improve message localization
  • Improve separation of parsing and diagnostics

Content Model

  • Per-element-and-version attribute support
  • DTD Internal Subset support
  • Modular XHTML support (XHTML 1.1)

Page last updated on 26 November, 2002 by C. Reitzel