[unixODBC-support] How driver manager converts between Unicode and ANSI

Daniel Vogelbacher daniel at vogelbacher.name
Thu Apr 11 22:19:17 BST 2013


On [Thu, 11.04.2013 21:46], Nick Gorham wrote:
> On 11/04/13 17:32, Daniel Vogelbacher wrote:
> > Hi,
> >
> > as far as I understand the official ODBC spec, the DM must convert
> > between wide-strings and ansi-strings. On Windows this is done by
> > converting from unicode to the current code page (locale setting) and
> > vice versa.
> > For example, if I use a ANSI-only driver and call SQLExecDirectW(),
> > the unicode string gets converted to my local code page (iso8859-1 or
> > something else) and passed to driver's SQLExecDirect().
> >
> > In real world, I discoverd two issues:
> >   1.) Most drivers ANSI functions expecting strings not in code page
> >   encoding, but in a driver-specifc encoding, for example a character
> >   set specified inside the DSN (like CharSet=utf8).
> >   If a user loads a ansi-only driver which expects string in encoding
> >   XY, how does the DM knows about that to perform the correct
> >   conversion between unicode and XY? (this is more a windows issue, but
> >   related to the next issue)
> >
> >   2.) The DM from unixODBC seems to do something totally curious when
> >   converting between unicode and ansi. I expected that it uses
> >   mbstowcs() & co. for conversion regarding the locale setting
> >   (en_US.utf8 or something else).
> >   But a lot of tests and a final look into the code later I discovered
> >   that the DM just choose iso8859-1... ?!
> >   This breaks the usage of the wide-api on application side and a
> >   ansi-only driver (like sqliteodbc) which expects UTF-8 strings.
> >   Is this really intended?
> >
> >   But even if the DM uses the locale information (how I expected),
> >   there is issue no. 1 for drivers which are expecting a specific
> >   charset (like the sqlite odbc driver).
> >
> >
> > I hope someone could help me with this. It's very confusing.
> 
> TBH, your questions mimic the confusion and compromises that is 
> involved. The default is 8859, but as you say Windows does much the 
> same. You can specify other iconv targets when you configure, but I had 
> to pick something for a default.
> 
> It can't use msbtowcs as sizeof( SQLWCHAR ) != sizeof( wchar_t ).

Well, you can convert SQLWCHAR strings to wchar_t strings, but this
does not solve any problems ;-)

> But you finally point out the real problem, it doent matter what the 
> driver manager uses, as the driver may ignore all that and do something 
> else. And of course, the driver manager can only convert those bits its 
> has access to, calls SQLGetData( SQL_C_CHAR ) on a unicode column and 
> the driver manager has no say in what happens.
> 
> And then there is the multibyte sequences like UTF8, The Easysoft 
> drivers have options to use UTF8, and so do others, but unlike other 
> DM's unixODBC doesn't treat them as WCHAR types, there is no point, and 
> it contradicts XOpen if it did. But you still have the problem of what 
> to do with part reads breaking a character sequence.
> 
> As you say its confusing, but I dont know of any way of simplifying it 
> without breaking something or loosing something that someone needs.

What is the recommended way for application programmers (where the
application uses wchar_t internally) to access "black-box" data
sources via ODBC?

With black-box I mean a) unknown driver, b) unknown unicode support,
c) unknown required ansi character set.

I've developed a programming language for ETL processes which uses my
own db wrapper library. I've added native support for various rdbms,
but I also want generic odbc support. The end-user knowns about the
supported SQL commands, but commands and data must be transported
through my library and I don't know which driver a user wants to load.

I hoped the Wide-API provides exactly what I need, but it seems
horrible after a deeper look.

The only "working" solution i could figure out is to implement both
wide and ansi APIs and the user must specify which one internally
should be used and for the anis API, a character set must be specified.
For example the configuration string for my library could be
something like

   "engine=odbc;odbcapi=ansi;charset=utf8;DSN=mysource"

or

   "engine=odbc;odbcapi=unicode;DSN=mysource"

So if the user really needs to load a odbc 2.0 driver which only
returns iso-850 chars, he can configure my lib this way (extreme
example).

I don't see any other chance to provide working odbc support.

By the way, the issues discussed here also affects other existing
software. For example the pyodbc module for python3 has the same
issues. If you access a SQLite db from python3 via unixODBC and insert
python string into it, the driver stores invalid characters in the
db.


If anyone has a better solution for the issues I would be very happy :)



-- 
     Daniel Vogelbacher
     www.chaospixel.com
     cytrinox at freenode/ircnet/quakenet

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mailman.unixodbc.org/pipermail/unixodbc-support/attachments/20130411/f1c483ff/attachment.bin>


More information about the unixODBC-support mailing list