Valid characters in attribute names in HTML/XML
This has been bugging me for a while, because I do a fair bit of HTML and XML custom parsing code, and kind of wondered what would be the valid characters for an attribute name in a HTML tag, e.g.
<a href="..." name="...">thing</a>
So, what are the valid characters in HTML (or XML) for “href” and “name”, the attribute names in an HTML tag?
I finally found this, here:
http://www.w3.org/TR/2000/REC-xml-20001006#NT-Name
In short, a HTML attribute name can be:
- First character is a letter, the underscore “_”, or colon “:” (oddly!)
- Additional (optional) characters can be: a letter, a digit, underscore, colon, period, dash, or a “CombiningChar” or “Extender” character, which I believe allows Unicode attributes names.
So, the following are all valid HTML attributes names:
: _ _0:funky :.:valid-_-tag-really:.: _.:._
Note that the W3C suggests only using colon for namespaces, so you should use it sparingly.
The regular expression, therefore, for parsing an HTML attribute is as follows:
[a-zA-Z_:][-a-zA-Z0-9_:.]
This, obviously, leaves the CombiningChar and Extender characters out of the mix.
And, a final note: When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?
I swear, when I’m looking for something simple, the mountains of documents I have to wade through makes it impossible to find anything with any accuracy.
My $0.02.


Doug Whitney Said,
May 6, 2009 @ 4:50 pm
This is exactly what I is looking for, plus the regex to boot!
Thanks for sharing.
JP Said,
June 18, 2009 @ 8:09 am
This has changed to a significantly larger character set, I think (from http://www.w3.org/TR/REC-xml/#NT-NameChar):
[4] NameStartChar ::= “:” | [A-Z] | “_” | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | “-” | “.” | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
Steve Heffernan Said,
June 19, 2009 @ 6:12 pm
Thanks for finding that!
Peter Said,
September 28, 2009 @ 8:39 am
^[a-zA-Z_:][-a-zA-Z0-9_:.] else 3Desc is not false!!!!
uchikoma Said,
March 29, 2010 @ 6:36 am
“When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?……..”
ANSWER
W3C specs are obviously designed to make people go insane. I, for example, am a little egg….