- you follow my blog posts
- there is a bug in Google's page rank mechanism for some search terms and you landed here. Was the search term 'encoding' or 'multi-bytes' or 'xml' or 'inputstream'?
This post's primary aim is to ensure that some knowledge sinks into my brain. I've this undesirable trait of forgetting solutions to problems I had faced earlier and fixed. The solution and at times the problem too seems to effectively escape from my brain after a short span of time (volatile RAM inside). So this entry is to etch the learning in my memory. If not it would at least serve as a ready reckoner to look up for solutions to problems that I've seen in my past.
So whats the big deal with multi-byte characters? Seriously, I dont know. All I know is that there are quite a few varieties of character set encodings out there. They propped up to fix some problem or the other with the various characters that are part of a plethora of languages that are in use today. An important piece of info that I had learned somehow was the existence of an character encoding set standard called 'UTF-8' which is recommended as a kind of superset of encodings.
Here is the problem:
There is a http service that needs to be invoked. You are given a URL (with the parameters required on the query string) to invoke the service. The response will be XML stream which will be well-formed and valid. The XML needs to be parsed and some java objects that are already there in the system need to be populated with the parsed xml data.
Now the solution that was adopted:
Used Apache commons-httpclient library's HttpPost and HttpClient to post the http request via the endpoint specified. You could use Sun's URLConnection instead if you fancy that. A Stax parser would be fine for the job, but the output xml had to be physically saved for a cron, which would process the XML file later. The InputStream as normally recommended was wrapped inside a BufferedInputStream. Data was read from the buffered stream and written to a FileOutputStream via an OutputStreamWriter. Xml was written out via the write(int) method. Apache's digester library was used to parse the xml and populate the java objects later for the cron. Here is the code snippet which does what was described above.
InputStream in = [API Call](url, params); BufferedInputStream bis = new BufferedInputStream(in); FileOutputStream fos = new FileOutputStream(filename); OutputStreamWriter outWriter = new OutputStreamWriter(fos, "UTF-8"); int numRead; while ((numRead = bis.read()) != -1) { outWriter.write(numRead); }
There was nothing wrong with the solution, but due to one implementation ignorance the solution did not work properly for multi-byte characters. The service's XML output document had explicitly specified encoding on it as 'UTF-8'. So when the service was accessed via the browser the data was being rendered by the user-agent properly. But when I opened the xml file on my disk through the same browser, some characters would come up garbled. I inferred that there was some error while writing out the xml from the response stream into the file. I remembered running into this same problem a few years back, but could not recollect what I had done then to fix it.
The encodings on the streams rightly is 'UTF-8'. It was puzzling why the code would not work well then for multi-byte strings. I was reading from the properly encoded stream via the read() method and writing that byte out to another properly encoded stream. It dawned on me to investigate the implementation of BufferedInputStream's read() method. The javadoc screamed that the method by default would read only byte-by-byte from the stream. Bingo! that seemed to be the cause of the problem when a single character was coming in through a double or multi byte representation. When I write out a byte, I was writing out only a part of that multibyte character and not its complete representation. Having identified the root cause of the problem, it was easier to fix it. Simply chunked the reading to consider many bytes at a time. I thought about my chunk size for a while. What value should I use? With some googling it seemed that UTF-8 could have chars that are between 1-4 bytes long. So 1*2*3*4 = 24 should be a good byte array size to start with. Any multiple of 24 should be fine I suppose. I set my chunking limit accordingly and then I converted that byte[] into a properly encoded String and sent that String to the writer to spit out the XML. Voila! it worked.
But I was not very convinced that the solution was exact. What if my chunk is such that a character with multibyte representation gets split into more than one chunk? Some more searching on the internet and I realized that while dealing with character streams, a BufferedReader is the most suitable class for the job. Streams are meant for binary data.
If only I had remembered this solution a few hours earlier, this post might not have fructified at all!
Anyways the modified code snippet that used a reader is below:
But I was not very convinced that the solution was exact. What if my chunk is such that a character with multibyte representation gets split into more than one chunk? Some more searching on the internet and I realized that while dealing with character streams, a BufferedReader is the most suitable class for the job. Streams are meant for binary data.
If only I had remembered this solution a few hours earlier, this post might not have fructified at all!
Anyways the modified code snippet that used a reader is below:
InputStream in = [API Call](url, params); FileOutputStream fos = new FileOutputStream(filename); OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8"); BufferedReader br = new BufferedReader(new InputStreamReader(in, "UTF-8")); String fileline; while((fileline = br.readLine()) != null) { osw.write(fileline); }I wonder how the reader classes have the intelligence to read character-by-character given the encoding set. Will definitely look that up sometime in the jdk ... hopefully ;-)