Internationalisation and character sets can be a pain in the ass. I avoid it as much as I can, and did until today because I reconed Java always runs UTF-16 and that it shouldn’t be a problem. 😉 It does run UTF-16 internally always, but the transition from webrequests into the JVM can be a bit tricky.
We need to accept special characters for all of Europe, and got tons of errors when we finally started testing with Greek characters. I found a couple of articles describing it that helped me solve it:
- Developing J2EE Global Applications : Character Encoding
- Developing Multilingual Web Applications Using JavaServer Pages Technology
It surprised me a bit that the web-client does not specify which encoding it submits back in. If the page that was received from the server and is beeing submitted, is in UTF-8 the client submits back UTF-8. But because it doesn’t specify the encoding on submit, the webserver must know/remember which encoding the pages were sent out in. This is not done by default in J2EE, so the encoding in the request is set to null. Because we run UTF-8 on all our pages (which is recommended practice), we just set the encoding in a filter for each request. If you have different users that might use different encodings you’ll have to keep track of it in a session or something like that.