Valid characters in attribute names in HTML/XML

This has been bugging me for a while, because I do a fair bit of HTML and XML custom parsing code, and kind of wondered what would be the valid characters for an attribute name in a HTML tag, e.g.

<a href="..." name="...">thing</a>

So, what are the valid characters in HTML (or XML) for “href” and “name”, the attribute names in an HTML tag?

I finally found this, here:

http://www.w3.org/TR/2000/REC-xml-20001006#NT-Name

In short, a HTML attribute name can be:

  • First character is a letter, the underscore “_”, or colon “:” (oddly!)
  • Additional (optional) characters can be: a letter, a digit, underscore, colon, period, dash, or a “CombiningChar” or “Extender” character, which I believe allows Unicode attributes names.

So, the following are all valid HTML attributes names:

:
_
_0:funky
:.:valid-_-tag-really:.:
_.:._

Note that the W3C suggests only using colon for namespaces, so you should use it sparingly.

The regular expression, therefore, for parsing an HTML attribute is as follows:

[a-zA-Z_:][-a-zA-Z0-9_:.]

This, obviously, leaves the CombiningChar and Extender characters out of the mix.

And, a final note: When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?

I swear, when I’m looking for something simple, the mountains of documents I have to wade through makes it impossible to find anything with any accuracy.

My $0.02.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Reddit
  • TwitThis

5 Comments »

  1. Doug Whitney Said,

    May 6, 2009 @ 4:50 pm

    This is exactly what I is looking for, plus the regex to boot!

    Thanks for sharing.

  2. JP Said,

    June 18, 2009 @ 8:09 am

    This has changed to a significantly larger character set, I think (from http://www.w3.org/TR/REC-xml/#NT-NameChar):
    [4] NameStartChar ::= “:” | [A-Z] | “_” | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
    [4a] NameChar ::= NameStartChar | “-” | “.” | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

  3. Steve Heffernan Said,

    June 19, 2009 @ 6:12 pm

    Thanks for finding that!

  4. Peter Said,

    September 28, 2009 @ 8:39 am

    ^[a-zA-Z_:][-a-zA-Z0-9_:.] else 3Desc is not false!!!!

  5. uchikoma Said,

    March 29, 2010 @ 6:36 am

    “When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?……..”

    ANSWER

    W3C specs are obviously designed to make people go insane. I, for example, am a little egg….

Leave a Comment

Comments are moderated, and will appear when approved. Hate speech, off-topic, and commercial replies are not permitted. Dissenting opinions are encouraged.