This has been bugging me for a while, because I do a fair bit of HTML and XML custom parsing code, and kind of wondered what would be the valid characters for an attribute name in a HTML tag, e.g.
<a href="..." name="...">thing</a>
So, what are the valid characters in HTML (or XML) for “href” and “name”, the attribute names in an HTML tag?
I finally found this, here:
In short, a HTML attribute name can be:
- First character is a letter, the underscore “_”, or colon “:” (oddly!)
- Additional (optional) characters can be: a letter, a digit, underscore, colon, period, dash, or a “CombiningChar” or “Extender” character, which I believe allows Unicode attributes names.
So, the following are all valid HTML attributes names:
: _ _0:funky :.:valid-_-tag-really:.: _.:._
Note that the W3C suggests only using colon for namespaces, so you should use it sparingly.
The regular expression, therefore, for parsing an HTML attribute is as follows:
This, obviously, leaves the CombiningChar and Extender characters out of the mix.
And, a final note: When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?
I swear, when I’m looking for something simple, the mountains of documents I have to wade through makes it impossible to find anything with any accuracy.