Notes |
|
(0001216)
|
Dave Higton
|
2016-02-10 20:07
(Last edited: 2016-02-10 20:10) |
|
I did an experiment. I found that the UTF-8 representation of lower case u umlaut is C3 BC. Going to the Chars application, I entered into NS's search box the characters with byte values C3 and BC. This allowed me to step through all the lower case u umlauts in the page; they were found.
It's not very useful. All the experiment shows is that the characters are stored internally by NS in UTF-8. They are translated from the source (e.g. ü) to UTF-8.
|
|
|
(0001219)
|
Dave Higton
|
2016-02-11 22:32
(Last edited: 2016-02-11 22:33) |
|
I've made some changes to my local dev version that go some way towards the functionality. I've added a new function ro_gui_get_icon_string2() that optionally transcodes from local to utf8 alphabet. However, when it's called, there are a couple of problems:
1) The search string that is put back into the search window from the search history is the utf8 encoded version, which of course wants to get transcoded again... a sort of recursion :-) I can't (yet) see where the history is taken from, so I can't correct it.
2) In the cited page, searching for "Gestüt", it just happens that one occurrence breaks across a line, and therefore isn't found, or at least isn't highlighted when it is found - I'm not sure. It's another reminder of some very very basic flaws in the layout engine.
So I'm sure it can be done, but I'm not there yet.
|
|
|
|
I've got a solution to (1) above: translate back from UTF8 to local when adding a search string to the search history.
However, as for point (2) above, it's worse than I thought. None of the instances of "Gestüt" in the main text block is found. I can't see why they should not be. I need a grown-up to tell me what the search function is expected to see and what it isn't. |
|
|
|
Belatedly I've looked in more detail at the HTML source of the cited page. The occurrences of u umlaut are coded two ways: the top block of text contains HTNL entities ("ü"), the rest are ISO8859-1 characters.
My experiments above indicate that the ISO8859-1 characters are transcoded internally to UTF-8, so my fix of transcoding the search string to UTF-8 finds them. However, I haven't found a way to input anything that finds the HTML entities - even putting the HTML entity into the search text box doesn't find them - so I don't know how they are encoded and/or searched for internally to NS.
I don't have a method that works enough to make it worthwhile, so I'm not submitting a patch at this point. |
|
|
|
This turns out to be more of a can of worms than I thought. One basic problem is that HTML entities, such as ü, are each put into their own boxes, so the string that includes them does not appear anywhere contiguously and therefore cannot be searched for. The cited page includes some characters that are encoded as HTML entities and some that are UTF-8 already and therefore can be searched for.
The more knowledgable developers say that this cannot be fixed until we replace the layout engine. |
|