Tuesday, March 10, 2009

Unicode Support in Web applications

UPDATE: While most of the points below are still true as of March 2014 the one thing you should do is to visit your whole stack for "Unicode" support. For example looking for tomcat you get this.

Regardless which programming language you use and what application server is involved, internationalization (i18n) support is about handling character encoding in a proper way. Below are guidelines to take into account for any combination of WEB Server/Language. Examples are shown for JBoss/Java though.

1. REQUEST ENCODING: Do not touch it in the server side. Do not use request.setCharacterEncoding() if you do not understand the implication. Only the client (browser) knows the supported encoding and it should be the only responsible for setting it. Most of the browsers use "ISO-8859-1" encoding when submitting data but for example using HTML FORM accept-charset attribute this encoding could be different. JBoss (4.0.4-GA at least) will not affect a GET request when request.setCharacterEncoding() is used but it will affect the POST parameters. It is clear that somehow the algorithm used to decode the query string is different than the one used for posted form data. By default, at least in my tests, request.getCharacterEncoding() returns always NULL.

2. REQUEST PARAMETERS PARSING: Browsers will encode "ISO-8859-1" even those characters not supported by that encoding. Be aware of your server limitations. A typical Servlet Container will try to interpret any request as "ISO-8859-1" and so corrupted bytes are going to end up in the String returned by ServletRequest.getParameter(). The good news is bytes can be read from that corrupted string and reencoded with the right encoding (the one you support in your application and I would say it should be 99% percent of the time UTF-8). To correct this issue wrap ServletRequest.getParameter() following http://www.adobe.com/devnet/server_archive/articles/internationalization_in_jrun.html advice. Notice however that it is assumed in that post that Browsers will always use "ISO-8859-1". A safer way is to find the encoding as below:
String requestEncoding = request.getCharacterEncoding();
if(requestEncoding == null || requestEncoding.trim().length() == 0) {
requestEncoding = "ISO-8859-1";
}


If you are using Tomcat you can avoid that all together with a a change in server.xml and the use of a CharacterEncodingFilter. These features should be available in other servers as well. So again use the below which will deal with GET requests:
<Connector port="8009" protocol="AJP/1.3" redirectPort="8443" URIEncoding="UTF-8"/>
And the below which will deal with POST requests:
<!-- First fileter must be the Set Character Encoding Filter -->
    <filter>
        <filter-name>setCharacterFilter</filter-name>
        <filter-class>org.apache.catalina.filters.SetCharacterEncodingFilter</filter-class>
        <init-param>
            <param-name>encoding</param-name>
            <param-value>UTF-8</param-value>
            </init-param>
        <init-param>
            <param-name>ignore</param-name>
            <param-value>false</param-value>
            </init-param>
    </filter>
    <filter-mapping>
        <filter-name>setCharacterFilter</filter-name>
        <url-pattern>*</url-pattern>
    </filter-mapping>

3. PROGRAM MANIPULATION: Java has unicode support out of the box but not all classes are privileged to enjoy it. Be careful when working with Properties.class as UTF-8 is not allowed. If UTF-8 support is needed then it must be escaped in properties files. We need then to convert UTF-8 strings into ASCII Java Escaped Unicode. For example the two Asian characters string "中医" (URL Encoded version: %E4%B8%AD%E5%8C%BB) should be converted to \u00ef\u00bb\u00bf\u00e4\u00b8\u00ad\u00e5\u0152\u00bb removing all but the first "\u" before using Properties#load() . There is a project with source code available that can be used to work around
this issue (https://advancenative2asciitool.dev.java.net/). Below is a class that turns the Unicode characters into ASCII Java Escaped Unicode:
/**
* All credits for the "Advance Native2ASCII Tool" from https://advancenative2asciitool.dev.java.net/
*/

package com.bittime.util.design;

/**
* Please refer to http://en.wikipedia.org/wiki/ASCII for definition on valid/printable ASCII characters
*/
public class Unicode2ASCII {

public static String toHTML(String unicode)
{
String tc = unicode;
String output = "";

char[] ca = tc.toCharArray();
int x;

for( x = 0; x < a =" ca[x];">126){
output += "&#" + (int)a + ";" ;
}
else{
output += a ;
}
}

return output;
}

public static String toJAVA(String unicode)
{
String tc = unicode;
String output = "";

char[] ca = tc.toCharArray();
int x;

for( x = 0; x < a =" ca[x];">255){
String hexString = "0000" + Integer.toHexString((int)a);
hexString = hexString.substring(hexString.length() - 4);
output += "\\u" + hexString ;
}
else{
output += a ;
}
}

return output;
}
}


4. RESPONSE HEADERS: An HTTP header (Content-Type) is responsible to give the client (Browser) the encoding used in the response. A typical example is "Content-Type: text/html; charset=UTF-8". The response content type then must be set not only with the mime type you want but in case of any text mime type the proper charset must be specified as well.

From JSP for example:
<%@page contentType="text/html; charset=UTF-8" %>
From Java:
response.setContentType("text/html; charset=UTF-8");

There is no need to call response.setCharacterEncoding() if setContentType() is called passing both the mime type (text/html) and the encoding (charset). If response.setCharacterEncoding("UTF-8") is used then still setContentType("text/html")must be called. Refer to http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletResponse.html#setCharacterEncoding(java.lang.String) for more explanation.

If you can identify if the request should be served as html in a Servlet Filter then it is a good idea to build a Unicode filter that will force the response to be UTF-8 for those requests:
public class UnicodeFilter implements Filter {


 public void doFilter(ServletRequest request, ServletResponse response,
   FilterChain chain) throws IOException, ServletException {
  response.setContentType("text/html; charset=UTF-8");
  chain.doFilter(request, response);
 }

 public void init(FilterConfig arg0) throws ServletException {
 }
 
 public void destroy() {
 }

}

5. RESPONSE CONTENT: Of course you must ensure all the body of the HTTP response uses the specified by RESPONSE HEADER encoding. This is not a big problem in Java but again there are subtle problems you need to be aware of. It is common for example to use JSP to specify markup. If the HTML included in the JSP needs non-ascii unicode characters then the encoding of the file must be the same as the one used in the responses AND as explained before an @page directive with contentType specified in it is required in absolutely all JSP pages. Add to the equation any file touched by building tools regardless if you use Maven, Ant, Make, Shell, manual, you-name-it methods. Any piece of information returned after the HTTP headers must be encoded as dictated by "Content-Type" header. So my advice is encode all your source files in UTF-8 if you need to support internationalization in your application. Finally, be sure your Server is running with default encoding set to UTF-8. Commonly this is not a problem in current versions of Linux where the default charset is precisely "UTF-8" but Windows uses "ISO-8859-1". In the case of Java Applications use the below when starting the server:
-Dfile.encoding=UTF-8
Particularly for JBoss Windows for example:
set JAVA_OPTS=-Dfile.encoding=UTF-8 %JAVA_OPTS%

You can always check from JSP the server encoding:
%=System.getProperty("file.encoding")%>

6. DEFAULT RESPONSE ENCODING: There are server specific configurations that will free the developer from setting the charset of the response. I do not recommend them since they break portability and might end up generating other problems related to internationalization.

7. BROWSER / OS language support: Firefox for example comes with support for all languages however Internet Explorer does not. Commonly WXP users running IE will need to install language support for Chinese and Thai, Font support in IE is tricky (Internet Options|General|Fonts|Latin Based;Times New Roman OR Arial;Courier New is a good selection, any other is not good for Unicode support), support for Assian languages must be selected manually (Control Panel | windows Regional and Language Options | Install files for 'East Asian Languages')

8. FILE ENCODING: If developers use Windows and Linux then default encoding in their Editors will be an issue. Linux uses UTF-8. Do not use non-ASCII characters in code if you are not able to force the use of UTF-8 from editors. If specific non ASCII code is needed for String comparison for example you can use escaped UTF-16 sequences for example in Java instead of ÀÁÂÃÄ use \u00c0\u00c1\u00c2\u00c3\u00c4

Here is an excellent Unicode table.

9. JDBC: If you are interacting with your database which is already storing UTF-8 successfully then it is time to tell your driver to use Unicode as well. Here is a typical URL for MySQL:
jdbc.url=jdbc:mysql://localhost:3306/myDB?Unicode=true&characterEncoding=UTF-8


10. FRAMEWORKS: Look for support of Unicode if you are using a specific Framework. For example in Spring org.springframework.web.filter.CharacterEncodingFilter will encode requests and responses.

No comments:

Followers