It's important to support a defined range of possible characters in filenames, as it is up to the user to assign names to his image files.
This is a rather hairy topic, because C++ is not good in handling characters bigger than a byte and there are major differences between platforms (windows and linux).
- C/C++ standard only specifies that a char is 8 bit and a wide character called wchar_t is wider than that. In fact on Microsoft it's 16bit, on gcc 32bit. With gcc you can store UTF32 character is a wchar_t, which are constant length, while on windows, it would be kind of truncated UTF32, because UTF16 is variable length. But note, that char or wchar_t doesn't say anything about the encoding.
- On modern Linux UTF8 is used normally. UTF8 is ASCII compatible (on the lower 7bit) and variably uses up to 4 bytes per character. The usage depends on the locale. It's normally set to de_DE.utf8 for example. With this setting all char strings are interpreted as UTF8. Therefore it's perfect to use char and std::string (instead of std::wstring) to store any unicode glyph. But byte counting methods like strlen or std::string::length() may not return exact values. Filenames are also UTF8, so sticking with char is ok and consumes only a quarter of memory compared to wstring.
- It's important to set the locale in your program. It defaults to "C" which only supports ASCII. For C++ std::locale::global(std::locale("")) sets the locale to the system environemt's. Note that the C function setlocale() doesn't affect C++ streams. But changing std::locale::global calls setlocale. To apply a specific locale to a stream use mystream.imbue(mylocale).
- On newer Windows they use their 16bit wchar_t internally (because they have to deal with lots of legacy software), so it doesn't save memory to use char, and if you use char, it depends on the windows localized version you have, how char strings are converted by the win api.
- NTFS stores filenames (according to http://en.wikipedia.org/wiki/NTFS) in 16bit words, and doesn't care about the encoding. The windows api has two calls one for with char and with wchar_t, that is hidden by some macro, normally. If you use the char api calls, the current codepage is used.
- Unicodes most important characters are in the first 16bit, called the BMP, basic multilingual plane. It holds virtually all useful characters. If we can handle all these characters, that should be more than enough.
- The filename tester tries creating files systematically.
- Florian Achleitner <firstname.lastname@example.org>
Generated on Wed Jul 30 01:25:54 2014 for Hugintrunk by