Category Archives: Web pages

Office’s garbled HTML

Brian Jones on why Microsoft Office 2000 (and later) produces such godawful HTML:

Our scenario was that people would start saving “docs” as HTML on their intranet sites and browse them with the browser. We viewed the browser as “electronic paper” that we had to “print” to (i.e. perfect fidelity). We had already got a lot of feedback from our Word97 Internet Assistant add-in that any loss of fidelity when saving as a web page was unacceptable and a “bug”. As it turned out, this usage scenario did not become as common as we thought it would and a zillion conspiracy theories formed about why we “really” did it. Many people assumed that a better approach would have been to save as “clean” HTML even if the result did not look exactly like what the user saw on the screen. We felt that the core office applications (other than FrontPage) were not really meant to be web page authoring tools, so we focused on converting docs to exact replicas in HTML. We didn’t want people losing any functionality when saving to HTML so we had to figure out a way to store everything that could have existed in a binary document as HTML. We thought we were clever creating a bunch of “mso-” css properties that allowed us to roundtrip everything. HTML didn’t take off in the same way we had expected, and today, the main use for Office HTML is for interoperability on the clipboard, though of course the biggest use is within e-mail (WordMail).

None of this explains why Office 2003’s “Filtered HTML” is so riddled with proprietary tags, though. Admittedly, a filtered HTML file is smaller than a roundtrip HTML file out of Word, but it’s still hugely bigger than the type of HTML you’d write from scratch (or in a web page editor such as Dreamweaver or Frontpage), and the source code is unreadable.

To my mind, Filtered HTML should be just that: HTML, filtered in such a way that the basic structure of the document is preserved, but none of the junk that Word (or whatever) stores along with it. Leave that for the roundtrip HTML — though I can’t see the appeal in that either, since if you want to store documents in a viewable form on the great InterWeb, PDF is the way to go. Or just store it in the native Office format for internal use, when you know every user will have the application or a viewer.

Word warning(By the way, when I was trying out the roundtrip HTML the other day, while reloading, Word presented me with a strange warning that it was going to query from some nonsense “Z” table to put data in the document. Bizarro. The test document did quote some SQL, but this would seem to suggest the roundtrip HTML isn’t all it’s cracked up to be.)

Anyway, Brian’s full article is about the progression of the Office formats from binary in the 90s into the XML to be used in the next version. Well worth a read if you want some background on the history, and where they’re going now.

Backslashes/Web dev toolbars

If you mistakenly put backslashes in your relative hyperlinks, IE silently replaces them with forward slashes. Does IE do this on Macs I wonder? It seems a very DOS-centric way of doing things. This is not “embrace and extend”. This is “be nice to sloppy people, breaking it for everybody else”. Firefox doesn’t like backslashes, correctly replacing them with %5C and then choking.

Meanwhile, MS has released a developer’s toolbar for IE (beta). I don’t normally use IE, but I had a quick look. WTF — it requires a complete system reboot to take effect. It looks like it has some handy features, but boy, it’s a bit buggy… try and view table outlines, and it takes ages if there’s more than a handful. Not so good.

Frankly, the Firefox web developer extension craps all over it.

Sneaky popups at Fairfax

The Age and SMH web sites have seen the writing on the wall for popup adverts, with browser popup blockers now blocking most ads that don’t occur as a result of direct user action.

So you know what they’ve done? Triggered a popup if you happen to click on part of an article window which normally wouldn’t be considered clickable, such as on a non-hyperlinked word. It’s a user action, so the popup gets around the blocker. It only seems to be triggered to happen occasionally though, so you don’t notice how the popup is triggered. Sneaky.

Cleaning up HTML out of Office

I found a good guide to cleaning out the gunk that’s in Word’s HTML documents. For the smallest most efficient files it seems to conclude that the Textism Wordcleaner — free for files under 20Kb; for bigger files subscription options are available. This issue has been causing me some angst for some time, and one of these days I’m going to bash out a tool for this myself. (Don’t hold your breath.)

The bandwidth hogs at allresearch.com

It seems like some others my sites are being bombarded with hits from a mob called AllResearch. Apparently one of the things they do is hit RSS feeds and suck down every page referenced, for some kind of indexing. Judging from the amount of traffic they’re burning up, they suck big-time, in fact. I mean, indexers usually put in a lot of hits on web sites, but these guys are hitting more than 10 times as much as the next one down the list, MSN.

These are the top hitters over at toxiccustard.com:

  • 45541 sp1.allresearch.com
  • 3448 msnbot.msn.com
  • 3110 index.atomz.com
  • 1328 crawl25-public.alexa.com

Time for a little .htaccess magic:

order allow,deny
deny from 38.144.36.
allow from all

Ad blocking begins to have an economic effect

So I was checking out copper (as you do), and followed the wikipedia copper entry link to EnvironmentalChemistry.com’s copper data, and I discovered that ad blockers are beginning to change the economics of the web. The web site whinged that they had detected ad blocking, and if I wanted to get the content I’d have to turn it off (and provided directions – which I followed, but it just turned out to be a bunch of atomic numbers and covalent bonds and useless crap like that).

The economics of a lot of the web are not dissimilar to those of free-to-air television; there’s a covenant between the producers (broadcasters/webauthors) and the consumers – we will let this stuff out to anyone, and you will consume our advertising. Advertisers give the producers cash to cover the costs of publishing. There’s a profit in it, and everyone’s happy.

Except that consumers have decided they don’t like the deal anymore. People are taping TV shows, and skipping the ads. People are using ad blockers in their browsers. The economics of the model are breaking down. I personally am behaving this way because I find the advertising increasingly intrusive and irrelevant, and thus annoying. The ads suck, for products that suck, and they’re shoved down my throat. So I avoid them. This is how a character in Carl Sagan’s novel Contact became the richest man on earth – by selling TV ad blockers.

The three outcomes I can forecast from this are:

  1. increased relevance of advertising (unlikely, the reason advertising is necessary is because of an inherent suckiness of the products, otherwise they’d be compelling)
  2. decreased expenditure on content provision (on TV, cheaper nastier shows – if that’s possible; on the web, uneconomic sites being pulled or at least not updated)
  3. product placement, which is a bit like 1, ‘cept different because it’s more about appropriate products in appropriate places

I for one have no idea how this will play out, but I’m sure advertising will get more subtle. It’s done that over the last century, and will continue to in response to increasing consumer sophistication. Perhaps advertisers will find a way to back off, and only offer their products to customers who want them; they certainly want to act that way, because it’s a waste of money advertising women’s sanitary napkins to the gay male viewers of Friends — unless they’re planning to fix their car’s leaky roof with one.

BTW, how did they figure out I was blocking their ads?

The joys of .htaccess

For those who merely dabble in Apache, .htaccess seems a little like black magic. Yet it’s so useful… it can do default (index) documents, redirection, password protection, custom 404s, blocking image stealers… everything! This set of pages serves as a useful tutorial for doing it all.

Excel to HTML

I can’t believe how stupid Excel (2002/XP) was with the table of browsers the other day.

The plan was to get the numbers into Excel, copy/paste into a Frontpage table to strip back the formatting, then paste into WordPress.

Nup, bloody monstrous Excel tags right the way through it, which Frontpage couldn’t override, and evidently no easy way to strip. No combination of Paste Special would work. So for example, instead of <td></td> we got:

<td align="right" x:num="1.15E-2" style="color: windowtext; font-size: 10.0pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; text-align: general; vertical-align: bottom; white-space: nowrap; border: medium none; padding-left: 1px; padding-right: 1px; padding-top: 1px"></td>

I kid you not. Now, I know about round-trip HTML, though I have my doubts that anybody uses it — firstly because it looks like crap in a web browser, and secondly because if you’ll want to edit it later, you’ll just keep an XLS copy. Besides, it’s badly implemented. The cell above was using the “Normal” style. It shouldn’t have had all the formatting crap embedded in it.

Word XP actually has a Save As Filtered HTML option to strip out all this crap. Excel XP doesn’t. (I haven’t checked Excel 2003 yet).

Plan 2 was to save it as HTML, load it into FrontPage and crop the HTML to paste into WordPress. Nup, trying to re-open it in FrontPage just threw it back to Excel. WTF?! Opening in UltraEdit (my preferred text editor) just revealed the same tags as above.

How can two Microsoft products that are part of the same suite, same version, operate so disastrously badly with one another, for something as simple as copying a table?

Plan 3? Oh bugger it, it’s only a few lines, just write it by hand.

If it were more I’d go install and run that clear The Useless Crap Out Of The HTML filter thing (oh look, they could do with clearing the crap out of their URLs too), but it refuses to install unless you have Office 2000. Wonderful.

Next time (after swearing a bit) I’ll probably save to CSV and then do a global replace from commas to table tags.

Surely there must be an easier way?

Sitepoint Anomaly

I’ve been meaning to buy a couple of books from sitepoint for a while now. I’ve borrowed a copy of their HTML Utopia: Designing Without Tables Using CSS, a fantastic guide to CSS and their Build Your Own Database Driven Website Using PHP & MySQL looks great so when they emailed me an offer of 20% off this book I thought why not.

That is until I saw the site. Ifyou spend over USD$70 (effectively two books) you get free postage anywhere in the world. Hmmm. Take the offer and save $7 off one book or reject the offer (which takes me below $70), pay full price and save $15?

Regardless, they’re great books.