This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: •


Greg,

Hopefully you'll get an answer from a real character-set junkie so you won't have to rely on me. But as it's late on Friday....

At 06:48 PM 6/7/2002, you wrote:

Why doesn't this XML content:      •
produce this output:               •

after parsing/xslt in my xhtml document???
Because HTML has a close enough family resemblance to XML that the presumption is that any string '& a m p ;' in your input (spaces put there to sanitize for obnoxious mailers), you want to *see* an & character displayed in your HTML browser, which requires that it be *represented* as '& a m p ;' in your (conformant) HTML source code (i.e. the serialized output of your transform).

As I'm sure you saw in the FAQ, the XSLT processor, after a file is parsed, "sees" a & character (a character no. 38) where there was an escaped character *reference* '& a m p ;' in your source. This facility allows XML to use the same character as an open markup delimiter for, of course, entity references.

Notice for these purposes there's no difference between "•" and "& so's your mama" -- "•" means "show me '•'" (the literal, not any character by that name), so it must perforce be represented as "•" since that's the way to tell an HTML *application* (browser) to do that.


It's bloody nigh impossible to get my XML parser (Xalan-Java) NOT to recognize entities except for this one case where recognizing it would solve all my problems.
Nope, it's recognizing this one too, it's just properly turning it *back* into an entity when you are serializing the file.


The xsl list FAQ under "Entities" item 13 "Passing Entities through a Transform" says that all entities are resolved before the transform and implies the only way to get around this is with a perlscript to strip entities of their ampersands. This cannot be the whole truth because:
a) xalan won't resolve &#amp; in the above example
The whole idea of changing the & into &#amp; is to stop it from being a reference (no you're right Xalan won't resolve it), thereby allowing the transmission of the string unchanged, so it can be twiddled back into the entity reference. If it had been a reference going in, it would have disappeared, leaving behind ... the character it had referred to. (Lots of the time this is actually fine.)


and b) everyone trying to produce html for posting would be screwed by having XML docs with proper unicode references--nobody could set set stuff up so cruelly (right?)
Well, actually they had no choice, it was either be cruel to be kind, or magically uninstall all the browsers ever deployed in the bad old days of HTML, when browsers cared less about "standards" than about conquering the universe. (Come to think of it, that would have been nice, I wonder why they didn't.)


c) In XSLT quickly, there's an example of how to define entities in the xsl stylesheet using <xsltext> to avoid this (p.90-91)--only you can't use this technique on a numbered entity because evidently that's not valid xml so they don't exist, even though they're all over the place.
Who says it's not valid XML? You can refer with a numbered character reference (entity) to any character allowed in XML.


I know this is an old subject; but after hours of investigating, I still don't get it. I need to know why the above example doesn't produce the right numbered entity reference, and what other ways there are to preserve entities through a transform
You can't. An entity reference cannot be preserved, period. The whole idea is that a parser will resolve the reference, turning it into the thing you said it was supposed to be.

That's why the canonical solutions -- such as the Perl pre- and post-processing massages, are all *workarounds* not solutions. They basically work by *disguising* the reference as some-funky-string-not-a-reference. It's like the parser is the bouncer at the concert and the entity reference is a beer. The Perl is putting your beer in a paper bag; then when you get to your seat you take it out again.


, and possibly how unicode/numbered entities are defined and can be redefined. There just has to be a way to do this within xslt. I'm sorry that I still don't get this--please help anyway, somebody.
*If* you are writing your output to a file -- and always will be -- you can use a feature supported in some XSL processors that starts with a 'd' and has three words, two of which are "output" and "escaping" (I forget the third). But this is *not as honest* a solution as the paper-bag workaround. At least then you are aware of what you are doing.

Now if browsers weren't broken to begin with none of this would have been a problem. (HotJava anyone?)

I hope that helps,
Wendell


======================================================================
Wendell Piez mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]