[LON-CAPA-dev] Problems with images in HTML files created with MS-Word

Jeremy Bowers lon-capa-dev@mail.lon-capa.org
Tue, 28 Sep 2004 02:29:20 -0500

Ricardo Luis Kulzer wrote:
> *<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <HTML xmlns="http://www.w3.org/TR/REC-html40" xmlns:v =
> "urn:schemas-microsoft-com:vml" xmlns:o =
> "urn:schemas-microsoft-com:office:office" xmlns:w =
> "urn:schemas-microsoft-com:office:word"><HEAD><TITLE>1</TITLE>
> <META http-equiv=Content-Type content="text/html; charset=windows-1252">
> <META content=Word.Document name=ProgId>
> <META content="MSHTML 6.00.2800.1458" name=GENERATOR>
> ...*

This is probably not helpful directly to you, Ricardo, but this may 
interest other developers on the list.

I would like to observe that this is *flagrently* illegal HTML, stunning 
even by Microsoft standards (and I've been dealing with Microsoft "HTML" 
since the first tentative steps with Office and Front Page were taken, 
so I've about seen it all). "xmlns" tags are only defined for XML (and 
by extension XHTML, imported from the XML standard), and then later tags 
like META are opened but not closed, and quotes are not used, and it is 
in general does not even remotely resemble an XML file, since any 
compliant XML parser would gag on the first META tag. (Assuming the 
spaces on the attributes are just email bugaboos.) Oh, and the fake 
XHTML has the wrong case (all XHTML tag names are lowercase and case 

For maximal compatibility with your student's machines, I'd echo Gerd's 
recommendation to run generated HTML through a well-known Microsoft 
HTML/XHTML fixer. Microsoft products do not really produce HTML, they 
just produce something vaguely *like* HTML, and you may find there are 
other situations where you will have better luck with cleaned up 
Microsoft HTML than the originals.

You can also download the free (open source) standalone Tidy program for 
Windows here: http://www.paehl.de/tidy/
In fact the sample screenshots show it taking care of a 
Microsoft-generated document.

(Adding something about this to the documentation would probably be a 
good idea :-) )