MantisBT - NetSurf
View Issue Details
0002320NetSurfRISC OS-specificpublic2015-05-29 19:042016-02-28 22:24
ReporterHarriet Bazley 
Assigned To 
PrioritynormalSeverityminorReproducibilityalways
StatusacknowledgedResolutionopen 
PlatformARMOSRISC OSOS Version5.19
Product Version3.4 
Target VersionFixed in Version 
Fixed in CI build #
Reported in CI build #
URL of problem page
Summary0002320: Cannot search for accented characters
DescriptionThe F4 search feature fails to locate any search string containing accented characters, e.g. "Gestüt" or "idée", even when such a string is displayed on the page - which severely limits the utility of this feature in a non-English context. It seems to apply both to cases where the character is represented in the source code as an HTML entity and where it is UTF-8-encoded.
Steps To ReproduceOpen http://www.gestuet-tannenhof.com/ for example.
Press F4 and type "Gestüt"; Netsurf reports "Not found"
Additional InformationI have a nasty feeling that this one involves some very complicated internal workings of Netsurf and isn't readily fixable, if at all....
TagsNo tags attached.
Attached Files

Notes
(0001216)
Dave Higton   
2016-02-10 20:07   
(Last edited: 2016-02-10 20:10)
I did an experiment. I found that the UTF-8 representation of lower case u umlaut is C3 BC. Going to the Chars application, I entered into NS's search box the characters with byte values C3 and BC. This allowed me to step through all the lower case u umlauts in the page; they were found.

It's not very useful. All the experiment shows is that the characters are stored internally by NS in UTF-8. They are translated from the source (e.g. ü) to UTF-8.

(0001219)
Dave Higton   
2016-02-11 22:32   
(Last edited: 2016-02-11 22:33)
I've made some changes to my local dev version that go some way towards the functionality. I've added a new function ro_gui_get_icon_string2() that optionally transcodes from local to utf8 alphabet. However, when it's called, there are a couple of problems:

1) The search string that is put back into the search window from the search history is the utf8 encoded version, which of course wants to get transcoded again... a sort of recursion :-) I can't (yet) see where the history is taken from, so I can't correct it.

2) In the cited page, searching for "Gestüt", it just happens that one occurrence breaks across a line, and therefore isn't found, or at least isn't highlighted when it is found - I'm not sure. It's another reminder of some very very basic flaws in the layout engine.

So I'm sure it can be done, but I'm not there yet.

(0001220)
Dave Higton   
2016-02-12 22:27   
I've got a solution to (1) above: translate back from UTF8 to local when adding a search string to the search history.

However, as for point (2) above, it's worse than I thought. None of the instances of "Gestüt" in the main text block is found. I can't see why they should not be. I need a grown-up to tell me what the search function is expected to see and what it isn't.
(0001290)
Dave Higton   
2016-02-16 15:20   
Belatedly I've looked in more detail at the HTML source of the cited page. The occurrences of u umlaut are coded two ways: the top block of text contains HTNL entities ("ü"), the rest are ISO8859-1 characters.

My experiments above indicate that the ISO8859-1 characters are transcoded internally to UTF-8, so my fix of transcoding the search string to UTF-8 finds them. However, I haven't found a way to input anything that finds the HTML entities - even putting the HTML entity into the search text box doesn't find them - so I don't know how they are encoded and/or searched for internally to NS.

I don't have a method that works enough to make it worthwhile, so I'm not submitting a patch at this point.
(0001340)
Dave Higton   
2016-02-28 22:24   
This turns out to be more of a can of worms than I thought. One basic problem is that HTML entities, such as ü, are each put into their own boxes, so the string that includes them does not appear anywhere contiguously and therefore cannot be searched for. The cited page includes some characters that are encoded as HTML entities and some that are UTF-8 already and therefore can be searched for.

The more knowledgable developers say that this cannot be fixed until we replace the layout engine.

Issue History
2015-05-29 19:04Harriet BazleyNew Issue
2015-06-01 14:48Vincent SandersStatusnew => acknowledged
2015-06-01 14:48Vincent SandersSteps to Reproduce Updatedbug_revision_view_page.php?rev_id=1482#r1482
2015-06-01 14:48Vincent SandersProduct Version => 3.4
2016-02-10 20:07Dave HigtonNote Added: 0001216
2016-02-10 20:09Dave HigtonNote Edited: 0001216bug_revision_view_page.php?bugnote_id=1216#r1783
2016-02-10 20:10Dave HigtonNote Edited: 0001216bug_revision_view_page.php?bugnote_id=1216#r1784
2016-02-11 22:32Dave HigtonNote Added: 0001219
2016-02-11 22:32Dave HigtonNote Edited: 0001219bug_revision_view_page.php?bugnote_id=1219#r1792
2016-02-11 22:33Dave HigtonNote Edited: 0001219bug_revision_view_page.php?bugnote_id=1219#r1793
2016-02-12 22:27Dave HigtonNote Added: 0001220
2016-02-12 22:28Dave HigtonAssigned To => Dave Higton
2016-02-12 22:28Dave HigtonSteps to Reproduce Updatedbug_revision_view_page.php?rev_id=1794#r1794
2016-02-12 22:29Dave HigtonStatusacknowledged => assigned
2016-02-16 15:20Dave HigtonNote Added: 0001290
2016-02-28 22:24Dave HigtonNote Added: 0001340
2016-02-28 22:24Dave HigtonAssigned ToDave Higton =>
2016-02-28 22:24Dave HigtonStatusassigned => acknowledged