Migrating from XML4C 2.x

Home

Installation

Build

API Docs

Samples

Programming

Migration

FAQs

Releases

Feedback

Bug-Todo

Download

CVS Repository

Mail Archive

This document is a discussion of the technical differences between XML4C 2.x code base and the new Xerces-C 1.4.0 code base.

Topics discussed are:

General Improvements

Compliance
Bug Fixes
Speed

Summary of changes required to migrate from XML4C 2.x to Xerces-C 1.4.0
The Samples
Parser Classes
DOM Level 2 support
Progressive Parsing
Namespace support
Moved Classes to src/framework
Loadable Message Text
Pluggable Validators
Pluggable Transcoders
Util directory Reorganization

util - The platform independent utility stuff

General Improvements

The new version is improved in many ways. Some general improvements are: significantly better conformance to the XML spec, cleaner internal architecture, many bug fixes, and faster speed.

Compliance

Except for a couple of the very obscure (mostly related to the 'standalone' mode), this version should be quite compliant. We have more than a thousand tests, some collected from various public sources and some IBM generated, which are used to do regression testing. The C++ parser is now passing all but a handful of them.

Bug Fixes

This version has many bug fixes with regard to XML4C version 2.x. Some of these were reported by users and some were brought up by way of the conformance testing.

Speed

Much work was done to speed up this version. Some of the new features, such as namespaces, and conformance checks ended up eating up some of these gains, but overall the new version is significantly faster than previous versions, even while doing more.

Summary of changes required to migrate from XML4C 2.x to Xerces-C 1.4.0

As mentioned, there are some major architectural changes between the 2.3.x and Xerces-C 1.4.0 releases of the parser, and as a result the code has undergone significant restructuring. The list below mentions the public api's which existed in 2.3.x and no longer exist in Xerces-C 1.4.0. It also mentions the Xerces-C 1.4.0 api which will give you the same functionality. Note: This list is not exhaustive. The API docs (and ultimately the header files) supplement this information.

parsers/[Non]Validating[DOM/SAX]parser.hpp
These files/classes have all been consolidated in the new version to just two files/classes: [DOM/SAX]Parser.hpp. Validation is now a property which may be set before invoking the parse. Now, the setDoValidation() method controls the validation processing.
The framework/XMLDocumentTypeHandler.hpp been replaced with validators/DTD/DocTypeHandler.hpp.
The following methods now have different set of parameters because the underlying base class methods have changed in the 3.x release. These methods belong to one of XMLDocumentHandler, XMLErrorReporter or DocTypeHandler interfaces.

[Non]Validating[DOM/SAX]Parser::docComment
[Non]Validating[DOM/SAX]Parser::doctypePI
[Non]ValidatingSAXParser::elementDecl
[Non]ValidatingSAXParser::endAttList
[Non]ValidatingSAXParser::entityDecl
[Non]ValidatingSAXParser::notationDecl
[Non]ValidatingSAXParser::startAttList
[Non]ValidatingSAXParser::TextDecl
[Non]ValidatingSAXParser::docComment
[Non]ValidatingSAXParser::docPI
[Non]Validating[DOM/SAX]Parser::endElement
[Non]Validating[DOM/SAX]Parser::startElement
[Non]Validating[DOM/SAX]Parser::XMLDecl
[Non]Validating[DOM/SAX]Parser::error

The following methods/data members changed visibility from protected in 2.3.x to private (with public setters and getters, as appropriate).

[Non]ValidatingDOMParser::fDocument
[Non]ValidatingDOMParser::fCurrentParent
[Non]ValidatingDOMParser::fCurrentNode
[Non]ValidatingDOMParser::fNodeStack

The following files have moved, possibly requiring changes in the #include statements.

MemBufInputSource.hpp
StdInInputSource.hpp
URLInputSource.hpp

All the DTD validator code was moved from internal to separate validators/DTD directory.
The error code definitions which were earlier in internal/ErrorCodes.hpp are now splitup into the following files:

framework/XMLErrorCodes.hpp - Core XML errors
framework/XMLValidityCodes.hpp - DTD validity errors
util/XMLExceptMsgs.hpp - C++ specific exception codes.

The Samples

The sample programs no longer use any of the unsupported util/xxx classes. They only existed to allow us to write portable samples. But, since we feel that the wide character APIs are supported on a lot of platforms these days, it was decided to go ahead and just write the samples in terms of these. If your system does not support these APIs, you will not be able to build and run the samples. On some platforms, these APIs might perhaps be optional packages or require runtime updates or some such action.

More samples have been added as well. These highlight some of the new functionality introduced in the new code base. And the existing ones have been cleaned up as well.

The new samples are:

PParse - Demonstrates 'progressive parse' (see below)
StdInParse - Demonstrates use of the standard in input source
EnumVal - Shows how to enumerate the markup decls in a DTD Validator

Parser Classes

In the XML4C 2.x code base, there were the following parser classes (in the src/parsers/ source directory): NonValidatingSAXParser, ValidatingSAXParser, NonValidatingDOMParser, ValidatingDOMParser. The non-validating ones were the base classes and the validating ones just derived from them and turned on the validation. This was deemed a little bit overblown, considering the tiny amount of code required to turn on validation and the fact that it makes people use a pointer to the parser in most cases (if they needed to support either validating or non-validating versions.)

The new code base just has SAXParer and DOMParser classes. These are capable of handling both validating and non-validating modes, according to the state of a flag that you can set on them. For instance, here is a code snippet that shows this in action.

void ParseThis(const  XMLCh* const fileToParse,
       const bool validate)
{
  //
  // Create a SAXParser. It can now just be
  // created by value on the stack if we want
  // to parse something within this scope.
  //
  SAXParser myParser;

  // Tell it whether to validate or not
  myParser.setDoValidation(validate);

  // Parse and catch exceptions...
  try
  {
    myParser.parse(fileToParse);
  }
    ...
};

We feel that this is a simpler architecture, and that it makes things easier for you. In the above example, for instance, the parser will be cleaned up for you automatically upon exit since you don't have to allocate it anymore.

DOM Level 2 support

Experimental early support for some parts of the DOM level 2 specification have been added. These address some of the shortcomings in our DOM implementation, such as a simple, standard mechanism for tree traversal.

Progressive Parsing

The new parser classes support, in addition to the parse() method, two new parsing methods, parseFirst() and parseNext(). These are designed to support 'progressive parsing', so that you don't have to depend upon throwing an exception to terminate the parsing operation. Calling parseFirst() will cause the DTD (or in the future, Schema) to be parsed (both internal and external subsets) and any pre-content, i.e. everything up to but not including the root element. Subsequent calls to parseNext() will cause one more pieces of markup to be parsed, and spit out from the core scanning code to the parser (and hence either on to you if using SAX or into the DOM tree if using DOM.) You can quit the parse any time by just not calling parseNext() anymore and breaking out of the loop. When you call parseNext() and the end of the root element is the next piece of markup, the parser will continue on to the end of the file and return false, to let you know that the parse is done. So a typical progressive parse loop will look like this:

// Create a progressive scan token
XMLPScanToken token;

if (!parser.parseFirst(xmlFile, token))
{
  cerr << "scanFirst() failed\n" << endl;
  return 1;
}

//
// We started ok, so lets call scanNext()
// until we find what we want or hit the end.
//
bool gotMore = true;
while (gotMore && !handler.getDone())
  gotMore = parser.parseNext(token);

In this case, our event handler object (named 'handler' surprisingly enough) is watching form some criteria and will return a status from its getDone() method. Since the handler sees the SAX events coming out of the SAXParser, it can tell when it finds what it wants. So we loop until we get no more data or our handler indicates that it saw what it wanted to see.

When doing non-progressive parses, the parser can easily know when the parse is complete and insure that any used resources are cleaned up. Even in the case of a fatal parsing error, it can clean up all per-parse resources. However, when progressive parsing is done, the client code doing the parse loop might choose to stop the parse before the end of the primary file is reached. In such cases, the parser will not know that the parse has ended, so any resources will not be reclaimed until the parser is destroyed or another parse is started.

This might not seem like such a bad thing; however, in this case, the files and sockets which were opened in order to parse the referenced XML entities will remain open. This could cause serious problems. Therefore, you should destroy the parser instance in such cases, or restart another parse immediately. In a future release, a reset method will be provided to do this more cleanly.

Also note that you must create a scan token and pass it back in on each call. This insures that things don't get done out of sequence. When you call parseFirst() or parse(), any previous scan tokens are invalidated and will cause an error if used again. This prevents incorrect mixed use of the two different parsing schemes or incorrect calls to parseNext().

Namespace support

The C++ parser now supports namespaces. With current XML interfaces (SAX/DOM) this doesn't mean very much because these APIs are incapable of passing on the namespace information. However, if you are using our internal APIs to write your own parsers, you can make use of this new information. Since the internal event APIs must be able to now support both namespace and non-namespace information, they have more parameters. These allow namespace information to be passed along.

Most of the samples now have a new command line parameter to turn on namespace support. You turn on namespaces like this:

SAXParser myParser;
// Tell it whether to do namespace
myParser.setDoNamespaces(true);

Moved Classes to src/framework

Some of the classes previously in the src/internal/ directory have been moved to their more correct location in the src/framework/ directory. These are classes used by the outside world and should have been framework classes to begin with. Also, to avoid name classes in the absense of C++ namespace support, some of these clashes have been renamed to make them more XML specific and less likely to clash. More classes might end up being moved to framework as well.

So you might have to change a few include statements to find these classes in their new locations. And you might have to rename some of the names of the classes, if you used any of the ones whose names were changed.

Loadable Message Text

The system now supoprts loadable message text, instead of having it hard coded into the program. The current drop still just supports English, but it can now support other languages. Anyone interested in contributing any translations should contact us. This would be an extremely useful service.

In order to support the local message loading services, we have created a pretty flexible framework for supporting loadable text. Firstly, there is now an XML file, in the src/NLS/ directory, which contains all of the error messages. There is a simple program, in the Tools/NLSXlat/ directory, which can spit out that text in various formats. It currently supports a simple 'in memory' format (i.e. an array of strings), the Win32 resource format, and the message catalog format. The 'in memory' format is intended for very simple installations or for use when porting to a new platform (since you can use it until you can get your own local message loading support done.)

In the src/util/ directory, there is now an XMLMsgLoader class. This is an abstraction from which any number of message loading services can be derived. Your platform driver file can create whichever type of message loader it wants to use on that platform. We currently have versions for the in memory format, the Win32 resource format, and the message catalog format. An ICU one is present but not implemented yet. Some of the platforms can support multiple message loaders, in which case a #define token is used to control which one is used. You can set this in your build projects to control the message loader type used.

Both the Java and C++ parsers emit the same messages for an XML error since they are being taken from the same message file.

Pluggable Validators

In a preliminary move to support Schemas, and to make them first class citizens just like DTDs, the system has been reworked internally to make validators completely pluggable. So now the DTD validator code is under the src/validators/DTD/ directory, with a future Schema validator probably going into the src/validators. The core scanner architecture now works completely in terms of the framework/XMLValidator abstract interface and knows almost nothing about DTDs or Schemas. For now, if you don't pass in a validator to the parsers, they will just create a DTDValidator. This means that, theoretically, you could write your own validator. But we would not encourage this for a while, until the semantics of the XMLValidator interface are completely worked out and proven to handle DTD and Schema cleanly.

Pluggable Transcoders

Another abstract framework added in the src/util/ directory is to support pluggable transcoding services. The XMLTransService class is an abtract API that can be derived from, to support any desired transcoding service. XMLTranscoder is the abstract API for a particular instance of a transcoder for a particular encoding. The platform driver file decides what specific type of transcoder to use, which allows each platform to use its native transcoding services, or the ICU service if desired.

Implementations are provided for Win32 native services, ICU services, and the iconv services available on many Unix platforms. The Win32 version only provides native code page services, so it can only handle XML code in the intrinsic encodings ASCII, UTF-8, UTF-16 (Big/Small Endian), UCS4 (Big/Small Endian), EBCDIC code pages IBM037 and IBM1140 encodings, ISO-8859-1 (aka Latin1) and Windows-1252. The ICU version provides all of the encodings that ICU supports. The iconv version will support the encodings supported by the local system. You can use transcoders we provide or create your own if you feel ours are insufficient in some way, or if your platform requires an implementation that we do not provide.

Util directory Reorganization

The src/util directory was becoming somewhat of a dumping ground of platform and compiler stuff. So we reworked that directory to better spread things out. The new scheme is:

util - The platform independent utility stuff

MsgLoaders - Holds the msg loader implementations

ICU
InMemory
MsgCatalog
Win32

Compilers - All the compiler specific files
Transcoders - Holds the transcoder implementations

Iconv
ICU
Win32

Platforms

AIX
HP-UX
Linux
Solaris
....
Win32

This organization makes things much easier to understand. And it makes it easier to find which files you need and which are optional. Note that only per-platform files have any hard coded references to specific message loaders or transcoders. So if you don't include the ICU implementations of these services, you don't need to link in ICU or use any ICU headers. The rest of the system works only in terms of the abstraction APIs.