2019-12-07 00:16 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0002320NetSurfRISC OS-specificpublic2016-02-28 22:24
ReporterHarriet Bazley 
Assigned To 
SeverityminorReproducibilityalways 
StatusacknowledgedResolutionopen 
PlatformARMOSRISC OSOS Version5.19
Product Version3.4 
Target VersionFixed in Version 
Summary0002320: Cannot search for accented characters
DescriptionThe F4 search feature fails to locate any search string containing accented characters, e.g. "Gestüt" or "idée", even when such a string is displayed on the page - which severely limits the utility of this feature in a non-English context. It seems to apply both to cases where the character is represented in the source code as an HTML entity and where it is UTF-8-encoded.
Steps To ReproduceOpen http://www.gestuet-tannenhof.com/ for example.
Press F4 and type "Gestüt"; Netsurf reports "Not found"
Additional InformationI have a nasty feeling that this one involves some very complicated internal workings of Netsurf and isn't readily fixable, if at all....
TagsNo tags attached.
Fixed in CI build #
Reported in CI build #
URL of problem page
Attached Files

-Relationships
+Relationships

-Notes
Dave Higton

~0001216

Dave Higton (developer)

Last edited: 2016-02-10 20:10

View 3 revisions

I did an experiment. I found that the UTF-8 representation of lower case u umlaut is C3 BC. Going to the Chars application, I entered into NS's search box the characters with byte values C3 and BC. This allowed me to step through all the lower case u umlauts in the page; they were found.

It's not very useful. All the experiment shows is that the characters are stored internally by NS in UTF-8. They are translated from the source (e.g. ü) to UTF-8.

Dave Higton

~0001219

Dave Higton (developer)

Last edited: 2016-02-11 22:33

View 3 revisions

I've made some changes to my local dev version that go some way towards the functionality. I've added a new function ro_gui_get_icon_string2() that optionally transcodes from local to utf8 alphabet. However, when it's called, there are a couple of problems:

1) The search string that is put back into the search window from the search history is the utf8 encoded version, which of course wants to get transcoded again... a sort of recursion :-) I can't (yet) see where the history is taken from, so I can't correct it.

2) In the cited page, searching for "Gestüt", it just happens that one occurrence breaks across a line, and therefore isn't found, or at least isn't highlighted when it is found - I'm not sure. It's another reminder of some very very basic flaws in the layout engine.

So I'm sure it can be done, but I'm not there yet.

Dave Higton

~0001220

Dave Higton (developer)

I've got a solution to (1) above: translate back from UTF8 to local when adding a search string to the search history.

However, as for point (2) above, it's worse than I thought. None of the instances of "Gestüt" in the main text block is found. I can't see why they should not be. I need a grown-up to tell me what the search function is expected to see and what it isn't.
Dave Higton

~0001290

Dave Higton (developer)

Belatedly I've looked in more detail at the HTML source of the cited page. The occurrences of u umlaut are coded two ways: the top block of text contains HTNL entities ("ü"), the rest are ISO8859-1 characters.

My experiments above indicate that the ISO8859-1 characters are transcoded internally to UTF-8, so my fix of transcoding the search string to UTF-8 finds them. However, I haven't found a way to input anything that finds the HTML entities - even putting the HTML entity into the search text box doesn't find them - so I don't know how they are encoded and/or searched for internally to NS.

I don't have a method that works enough to make it worthwhile, so I'm not submitting a patch at this point.
Dave Higton

~0001340

Dave Higton (developer)

This turns out to be more of a can of worms than I thought. One basic problem is that HTML entities, such as ü, are each put into their own boxes, so the string that includes them does not appear anywhere contiguously and therefore cannot be searched for. The cited page includes some characters that are encoded as HTML entities and some that are UTF-8 already and therefore can be searched for.

The more knowledgable developers say that this cannot be fixed until we replace the layout engine.
+Notes

-Issue History
Date Modified Username Field Change
2015-05-29 19:04 Harriet Bazley New Issue
2015-06-01 14:48 Vincent Sanders Status new => acknowledged
2015-06-01 14:48 Vincent Sanders Steps to Reproduce Updated View Revisions
2015-06-01 14:48 Vincent Sanders Product Version => 3.4
2016-02-10 20:07 Dave Higton Note Added: 0001216
2016-02-10 20:09 Dave Higton Note Edited: 0001216 View Revisions
2016-02-10 20:10 Dave Higton Note Edited: 0001216 View Revisions
2016-02-11 22:32 Dave Higton Note Added: 0001219
2016-02-11 22:32 Dave Higton Note Edited: 0001219 View Revisions
2016-02-11 22:33 Dave Higton Note Edited: 0001219 View Revisions
2016-02-12 22:27 Dave Higton Note Added: 0001220
2016-02-12 22:28 Dave Higton Assigned To => Dave Higton
2016-02-12 22:28 Dave Higton Steps to Reproduce Updated View Revisions
2016-02-12 22:29 Dave Higton Status acknowledged => assigned
2016-02-16 15:20 Dave Higton Note Added: 0001290
2016-02-28 22:24 Dave Higton Note Added: 0001340
2016-02-28 22:24 Dave Higton Assigned To Dave Higton =>
2016-02-28 22:24 Dave Higton Status assigned => acknowledged
+Issue History