Summary
How to avoid the nightmarish problems unwittingly caused by dumping text from external documents such as MS Word into Web pages.
One of the biggest mistakes people make when doing Web Sites is thinking that its okay to copy text from any location (MS Word documents are the worst) and just paste it into their Web pages. It's not - and it can spoil a perfectly good Web Site. So why not?
Because... when you copy text from another document to use in your own you are often unwittingly copying much more than just the text. Along with the text you grab will come a varying amount of associated junk which was used to control how the text looked in the old document. By way of example - if you open a HTML editor like Dreamweaver and type a sentence of text on a blank Web page it will produce the following quantity of html:
Example 1:
Typing "The quick brown fox" on a blank page in dreamweaver or frontpage produces...
<p>The quick brown fox</p>
...in the html code.
Cutting the same sentence out of typical MS Word and pasting it in might produce...
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:Arial">The quick brown fox</span></p>
Example 2:
Even worse than the dumping from Word approach is if someone actually creates a document in Word and then chooses the 'Save as Web Page' option. If we take the same example from above and type 'The quick brown fox jumped over the lazy dog' into Word and maybe also add a table for extra effect. Then when we choose Save As Web page the html created is so horrendous it will not comfortably fit on this page! Have a look at the HTML in the PDF below and you will see just how bad this problem can be:
Example of bad HTML
PDF 12KB
By comparison the same sentence and table in Dreamweaver produces...
If you have looked at the PDF above you will have seen the additional information surrounding the sentence - and this problem gets worse the more formatted the text is in the document. You can end up with huge blocks of code junk around simple sentences, which means more complex html that doesn't pass accessibility validation tests and increased file sizes for your Web pages which in turn means slower load times.
You can avoid all this bad junk by doing one simple thing with any text you want to put on your site - 'clean' it first. You clean it by using a simple text editor like 'Notepad' that comes as standard with Windows (Start > All programs > Accessories > Notepad). All you do is copy your text from the original document, paste it into Notepad, select the text again in Notepad, copy it and then paste it into your Web page.
Doing this will strip away all the 'rubbish' and leave you with the raw unstyled text, but the text will still have some useful formatting like paragraphs, line breaks and punctuation.