What’s up with those web addresses that have special characters in them?
As technology expands across the globe, the internet is steadily globalizing as well. According to the international telecommunications union (ITU), more than 3 billion people use the internet today — and increasingly so in their own native languages. This change was partly brought on by the introduction of international domain names (IDN) back in 2003.
What Is An Internationalized Domain Name (IDN)?
Until 2003, domains were only allowed to consist of letters from the Latin alphabet, the numbers 0-9, and hyphens. These limited options can be further explained by taking a deeper look at the domain name system (DNS). That service, which is responsible for translating URLs into IP addresses, operates on a naming scheme which is based on the American Standard Code for Information Interchange (ASCII). It’s a system that is mostly built on English-language keyboards and isn’t very representative for an international project like the internet.
In order to compensate for this drawback, a new system called the Internationalizing Domain Names in Application (IDNA) was created. The goal of this mechanism is to define a standardized translation from Unicode into ASCII, making it possible to display characters of every known alphabet within domain names. IDNA is considered to be one of the biggest revolutions in the history of the internet. The system is helpful for individuals using Asian, African, or Arabic character systems. Theoretically, every Unicode text can be used in an internationalized domain name. However, domain registries are also able to individually decide which special characters can be used for registration. Selection tends to vary, as domain registries are able to individually determine which special characters can be used for registration. This means that symbols differ depending on which top-level domain (e.g. .com, .mx, .ca, etc.) is used.
How Does IDNA Work?
Much of the web’s infrastructure is only supported by the ASCII character set. In order to make sure these internationalized domain names can actually be processed, each IDN that’s available in Unicode can also be translated into an ACE string, which is based on ASCII. Website URLs featuring characters with accents or umlauts are displayed, but the server continues to process the addresses as ASCII compatible. These processes are specified as IDNA2003 and IDNA2008. Translating from Unicode to ASCII occurs client-side and is based on a standardized coding process called Punycode.
Punycode – The RFC 3492 standardized Punycode was developed for clearly displaying Unicode character strings without loss of quality to ASCII symbols. All non-ASCII characters are removed from the domain name, coded, and separated with a hyphen.This code sequence contains information about the Unicode symbol in question as well as its position in the domain name. Additionally, each ACE string created in this way is mounted with the prefix xn at the beginning which clarifies to the user that the character sequence is IDN coded according to IDNA and Puny coding standards. Let’s look at a quick example:
IDN Form: müllers-café.com
ACE String: xn--mllers-caf-k7a2t.com
The prefix xn, which labels the domain as an ACE string is followed by the part of domain name that’s been removed of all non-ASCII characters, mllers-caf. The coded special characters k7a2t were then tagged on to the end of the domain and separated by a hyphen.
Versign’s IDN conversion makes it easy to translate from ACE into IDN or vice versa.
Differences Between IDNA2003 and IDNA2008
For the original procedure (back in 2003), internationalized domains were normalized prior to Punycode encoding using the nameprep method. This method changed capital letters into lowercase letters, removed control characters, and transferred equivalent characters into a unified form. Nameprep was removed from this process with the introduction of IDNA2008. Now, IDNA does not specify any normalization. Instead, it recommends a specific algorithm that converts capital letters into lowercase ones.