Basic4ppc - Windows Mobile Development  

Go Back   Basic4ppc - Windows Mobile Development > Main Category > Questions & Help Needed
Home Register FAQ Members List Search Today's Posts Mark Forums Read

Questions & Help Needed Post any question regarding Basic4ppc.


Character encoding / code pages


Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 05-27-2008, 09:28 PM
Knows the basics
 
Join Date: Apr 2008
Location: Duesseldorf, Germany
Posts: 71
Default Character encoding / code pages

Hello,

i've got a question regarding different / localized character encodings.

In the app i'm working on i receive a certain text from an internet server.I use two different transport methods to get this text, one method is http and the other one is pop3( mail).The text is independent from which transport method i choose and it is always using the same character encoding.

The text is then printed in a textbox control.

The text contains non-US/ACSII characters ( german umlauts for example).For the http method i use the code page style webresponse object:

Code:
Response.New2(1252)

which is working fine and as expected ( the Response.New1 leaves me with unprintable characters).

The text is in the ISO 8859-1 encoding, which is an 8bit extension to the 7bit ACSII - see ISO/IEC 8859 - Wikipedia, the free encyclopedia .

The used code page 1252 in the response is pretty much the same as ISO8859-1 encoding except for some control characters which does not matter in this case - see Windows-1252 - Wikipedia, the free encyclopedia


If i receive the same text from the POP3/Mail server, i end up with unprintable characters ( squares).Using a network sniffer i can see that the text is encoded in exactly the same way when received through http.

So let's say, the text contains a german "ö", which has a hex code of F6 in ISO8859-1 encoding.Due to the lack of any code page handling in B4P ( except a few intructions as the mentioned webresponse.new2) my plan was just to substitute the ISO codes for umlauts using a

Quote:
text = StrReplace(text,Chr(246),"ö")

but this does not work, probably due to the fact that Chr() does not know about Code Page 1252 or ISO8859-1.

Questions regarding that matter:

- how does Basic4PPC handle different code pages ?

- does it at all or does it completely rely on UTF-8 encoding ? ( Chr() does not appear to be able to cope with UTF-8 ???) or on ASCII encoding ?

- what about MIME/quoted-printable encoding ?

- how can i solve my problem outlined above ? Manual character conversion is relatively complex and time-consuming.


Kind regards

TWELVE
Reply With Quote
  #2 (permalink)  
Old 05-27-2008, 09:54 PM
Knows the basics
 
Join Date: Apr 2008
Location: Duesseldorf, Germany
Posts: 71
Default

Meanwhile i found a solution for my particular problem:

Since i use the network library to communicate with the POP3 server and a bitwise object to convert between strings and binary bytes, the following works similar to the solution i use for the http response:

Code:
 bit.New2(1252)
Although this is working ok now for me, i still want to have answered my quesions above...

cheers

TWELVE
Reply With Quote
  #3 (permalink)  
Old 05-28-2008, 10:08 AM
agraham's Avatar
Basic4ppc Expert
 
Join Date: Jul 2007
Location: Cheshire, UK
Posts: 1,770
Awards Showcase
Beta Tester Forum Contributer 
Total Awards: 2
Default

Quote:
Originally Posted by TWELVE View Post
how does Basic4PPC handle different code pages ?
It doesn't and has no need to! .NET uses UTF-16 (2 byte characters) encoding internally so there is no need for code pages. Wide characters are used for ease of string manipulation and indexing so that each character is a known fixed size.

Quote:
does it at all or does it completely rely on UTF-8 encoding ?
.NET streams normally convert from UTF-16 to UTF-8 and vice versa on input and output. They can also convert to and from a non UTF-8 single byte character stream. To do so they use an Encoding object associated with a code page. BinaryFile.New2 lets you initialise the Encoding object associated with the stream to the codepage you require.

Quote:
( Chr() does not appear to be able to cope with UTF-8 ???) or on ASCII encoding ?
Chr() is UTF-16 based and only knows about wide characters. By the time characters are inside B4PPC they are UTF-16 characters

Quote:
what about MIME/quoted-printable encoding ?
They will be treated as any other character stream.

Quote:
[ how can i solve my problem outlined above ?
I'm not sure that I completely understand your problem but any problems are caused at the interface from .NET to the OS or outside world. Http should be UTF-8 based which is why the character coding works correctly. Your details on the POP3 stream seem contradictory. From what you say it sounds like the POP3 stream is the same as the Http stream and so is also UTF-8 which would mean that unmlauts are encoded as two bytes. But you also say that bit.New2(1252) works which implies that the stream is actually single byte characters coded to code page 1252. Bit.New2() is the correct solution in this sort of case where you are dealing with a single byte "code paged" character stream.

Last edited by agraham : 05-28-2008 at 11:31 AM.
Reply With Quote
  #4 (permalink)  
Old 05-28-2008, 10:35 AM
klaus's Avatar
Basic4ppc Expert
 
Join Date: Oct 2007
Location: Switzerland
Posts: 707
Awards Showcase
Beta Tester Competition Winner 
Total Awards: 2
Default

Hi TWELVE,

JamesC had a similar problem with german characters coded in a single byte.
What happened to the ß?

Erel's solution was the same with the binary file and bin.New2(c,Code Page number).

Hi Erel,
The link in the help file for the Code Page numbers doesn't work anymore, it says Contend not found.

Best regards.
__________________
Klaus
Switzerland

Last edited by klaus : 05-28-2008 at 10:38 AM.
Reply With Quote
  #5 (permalink)  
Old 05-28-2008, 11:03 AM
Erel's Avatar
Administrator
 
Join Date: Apr 2007
Posts: 3,199
Default

Code page link: Code Page Identifiers
It was updated in version 6.30.
Reply With Quote
  #6 (permalink)  
Old 05-28-2008, 11:18 AM
klaus's Avatar
Basic4ppc Expert
 
Join Date: Oct 2007
Location: Switzerland
Posts: 707
Awards Showcase
Beta Tester Competition Winner 
Total Awards: 2
Default

It's strange, because when I start the help file from the 6.30 IDE and click on the link I get the message Contend not found.

Best regards.
__________________
Klaus
Switzerland
Reply With Quote
  #7 (permalink)  
Old 05-28-2008, 11:19 AM
Erel's Avatar
Administrator
 
Join Date: Apr 2007
Posts: 3,199
Default

On which topic?
Reply With Quote
  #8 (permalink)  
Old 05-28-2008, 01:06 PM
klaus's Avatar
Basic4ppc Expert
 
Join Date: Oct 2007
Location: Switzerland
Posts: 707
Awards Showcase
Beta Tester Competition Winner 
Total Awards: 2
Default

Binary file New2

Attached a screenshot.

Best regards
Attached Images
File Type: jpg New2.jpg (65.7 KB, 17 views)
__________________
Klaus
Switzerland
Reply With Quote
  #9 (permalink)  
Old 05-29-2008, 12:59 AM
Knows the basics
 
Join Date: Apr 2008
Location: Duesseldorf, Germany
Posts: 71
Default

@agraham:

Quote:
Chr() is UTF-16 based and only knows about wide characters. By the time characters are inside B4PPC they are UTF-16 characters

Maybe internal, but Chr() help is talking about ACSII and a value range of 0 to 255:

Quote:
Returns the ASCII character represented by the given number.
Syntax: Chr (Integer)
Integer ranges from 0 to 255.
Quote:
what about MIME/quoted-printable encoding ?

They will be treated as any other character stream.
What does this mean..? :-) If the Compiler and or the OS is dealing with UTF internally, some conversion might be needed if a character/stream is coming in with an encoding different from UTF.


Quote:
Http should be UTF-8 based which is why the character coding works correctly.
That's not true.A http stream can use UTF-8, but this is not obligatory.The used/supported encoding is determined by server and client and can be read from a http header.

Quote:
Your details on the POP3 stream seem contradictory. From what you say it sounds like the POP3 stream is the same as the Http stream and so is also UTF-8 which would mean that unmlauts are encoded as two bytes.
I'm afraid this is called a wrong assumption...

Both streams are in the same encoding, which is ISO8859-1 and NOT UTF-8.So it's clear no matter what transport is used a conversion has to take place.
Because i cannot guess in what encoding a text is i need some hint, and this is usually contained in a header.


Quote:
But you also say that bit.New2(1252) works which implies that the stream is actually single byte characters coded to code page 1252.
That's absolutely true.

So for me the conclusions from this are as following:


- the programmer does not need to take care about character encodings as long as everything is kept in UTF

- strings in basic4ppc are in UTF

- if a (foreign) character/stream from outside enters a basic4ppc variable,
a conversion needs to take place, if the stream is not in UTF.

- the conversion can only be done properly, if the stream's code page is known and a conversion function supporting a code page is available

- if no code page is specified, basic4ppc seems to interpret the non-UTF stream as ASCII ( this is why i could read most of the text, but the umlauts were replaced with the squares), which equals to the lower 7 Bit of any ISO8859 charset.

For a http stream this can be achieved easily by interpreting the content-type header, which contains the used charset.But the (ISO-)Charset number needs to be converted to a code page, though.

cheers

TWELVE
Reply With Quote
  #10 (permalink)  
Old 05-29-2008, 09:41 AM
agraham's Avatar
Basic4ppc Expert
 
Join Date: Jul 2007
Location: Cheshire, UK
Posts: 1,770
Awards Showcase
Beta Tester Forum Contributer 
Total Awards: 2
Default

Quote:
Originally Posted by TWELVE View Post
Maybe internal, but Chr() help is talking about ACSII and a value range of 0 to 255:
It's wrong. Try this "For i = 1024 To 1124 :msg = msg & Chr(i) : Next : msgbox(msg)"
Quote:
What does this mean..? :-) If the Compiler and or the OS is dealing with UTF internally, some conversion might be needed if a character/stream is coming in with an encoding different from UTF.
It might but as I tried to explain any conversion is done by the stream at the boundary of the .NET world and you need to specify the conversion necessary.
Quote:
That's not true ... encoding is determined by server and client and can be read from a http header.
Right. Due to my utter lack of interest, and hence utter lack of knowledge, in all things Webby I made a false assumption. I now understand how the Http stream converted the characters properly without it being UTF-8.
Quote:
the programmer does not need to take care about character encodings as long as everything is kept in UTF
Correct.
Quote:
strings in basic4ppc are in UTF
Correct, held in UTF-16 format each character occupying two bytes.
Quote:
if a (foreign) character/stream from outside enters a basic4ppc variable, a conversion needs to take place, if the stream is not in UTF.
Correct, achieved by attaching a .NET Encoding object to the stream and specifying to that object the conversion to be made.
Quote:
the conversion can only be done properly, if the stream's code page is known and a conversion function supporting a code page is available
Correct
Quote:
if no code page is specified, basic4ppc seems to interpret the non-UTF stream as ASCII ( this is why i could read most of the text, but the umlauts were replaced with the squares), which equals to the lower 7 Bit of any ISO8859 charset.
To be pedantic (again ) basic4ppc doesn't interpret anything, it receives UTF-16 from a stream. It depends on the stream how the encoding is treated. How are you getting this ASCII default? I assume you are using a BinaryFile object as the stream which if opened by New1 gives you the choice of ASCII or UTF-8 or if opened by New2 requires a codepage to be specified. I see no default behaviour
Quote:
For a http stream this can be achieved easily by interpreting the content-type header, which contains the used charset.But the (ISO-)Charset number needs to be converted to a code page, though.
From your experience it looks like the WebResponse object in the Http library takes care of this as it is part of the Http protocol - hence my false assumption of UTF-8. The Network library, not knowing about higher level protocols doesn't and just provides a byte stream which, as you say, may need conversion.

EDIT :- I'm wrong again about Webby stuff and the WebResponse handling things - I just saw your "Response.New2(1252)" in the first post. I suppose you need to New The WebRequest with the required code page and use the same codepage for Newing the WebResponse!

Last edited by agraham : 05-29-2008 at 03:08 PM.
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
ascii character problem Gale Johnson Questions & Help Needed 1 05-28-2008 04:59 PM
possible to use a character to break off lines of code? Stellaferox Questions & Help Needed 2 02-11-2008 10:41 PM
Walking character using the Sprite library Erel Code Samples & Tips 2 01-18-2008 06:46 PM
Replace encoding UTF 8 by UTF 7 EdQas Questions & Help Needed 6 09-16-2007 05:36 PM
is a single character string a number as well? Stellaferox Questions & Help Needed 16 06-08-2007 12:01 PM


All times are GMT. The time now is 02:16 AM.


Powered by vBulletin® Version 3.6.12
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.1.0