Friday, August 27, 2010

UTF-8 and other pleasures

When I played with concordle 2 years ago, the world was only very little UTF-ized, so concordle works with the text where A is 65 and B is 65, while @ is 64. And so on, the unwanted signs as quotes and commas can all be listed easily, there are relatively few of them. So when you copied a text from a page into concordle, usually all went well. Now the world is UTF-8 and perhaps even threatening to become UTF-16-ized, I do not really know.

Warning: as the situation is now, concordle can do strange things with a new text, because it might contain encoding which upsets concordle's simple mind knowing ASCII-like with a little bit of some ISO.

What is UTF-8? UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode.(yes: en.wikipedia.org/wiki/UTF-8). So I will need to do some more work on this soon!

For now: If you want to use a bit seriously concordle, be sure that the entered text does not contain "variable length encoding". It can be quotes (why are quotes not just a simple thing? Do not ask me!) or even a hyphen or minus (hyphen and minus are not the same thing, you see!). I will need to dive into Unicode and its 8-bit cousin - or incarnation.

No comments: