0002320: Cannot search for accented characters - MantisBT

2025-07-02 01:12 BST

View Issue Details [ Jump to Notes ]

[ Issue History ] [ Print ]

ID

Project

Category

View Status

Date Submitted

Last Update

0002320

NetSurf

RISC OS-specific

public

2015-05-29 20:04

2016-02-28 22:24

Reporter

Harriet Bazley

Assigned To

Severity

minor

Reproducibility

always

Status

acknowledged

Resolution

open

Platform

ARM

OS

RISC OS

OS Version

5.19

Product Version

3.4

Target Version

Fixed in Version

Summary

0002320: Cannot search for accented characters

Description

The F4 search feature fails to locate any search string containing accented characters, e.g. "Gestüt" or "idée", even when such a string is displayed on the page - which severely limits the utility of this feature in a non-English context. It seems to apply both to cases where the character is represented in the source code as an HTML entity and where it is UTF-8-encoded.

Steps To Reproduce

Open http://www.gestuet-tannenhof.com/ for example.
Press F4 and type "Gestüt"; Netsurf reports "Not found"

Additional Information

I have a nasty feeling that this one involves some very complicated internal workings of Netsurf and isn't readily fixable, if at all....

Tags

No tags attached.

Fixed in CI build #

Reported in CI build #

URL of problem page

Attached Files

Relationships

Relationships

Notes
~0001216 Dave Higton (developer) 2016-02-10 20:07 Last edited: 2016-02-10 20:10 View 3 revisions	I did an experiment. I found that the UTF-8 representation of lower case u umlaut is C3 BC. Going to the Chars application, I entered into NS's search box the characters with byte values C3 and BC. This allowed me to step through all the lower case u umlauts in the page; they were found. It's not very useful. All the experiment shows is that the characters are stored internally by NS in UTF-8. They are translated from the source (e.g. ü) to UTF-8.

~0001219 Dave Higton (developer) 2016-02-11 22:32 Last edited: 2016-02-11 22:33 View 3 revisions	I've made some changes to my local dev version that go some way towards the functionality. I've added a new function ro_gui_get_icon_string2() that optionally transcodes from local to utf8 alphabet. However, when it's called, there are a couple of problems: 1) The search string that is put back into the search window from the search history is the utf8 encoded version, which of course wants to get transcoded again... a sort of recursion :-) I can't (yet) see where the history is taken from, so I can't correct it. 2) In the cited page, searching for "Gestüt", it just happens that one occurrence breaks across a line, and therefore isn't found, or at least isn't highlighted when it is found - I'm not sure. It's another reminder of some very very basic flaws in the layout engine. So I'm sure it can be done, but I'm not there yet.

~0001220 Dave Higton (developer) 2016-02-12 22:27	I've got a solution to (1) above: translate back from UTF8 to local when adding a search string to the search history. However, as for point (2) above, it's worse than I thought. None of the instances of "Gestüt" in the main text block is found. I can't see why they should not be. I need a grown-up to tell me what the search function is expected to see and what it isn't.

~0001290 Dave Higton (developer) 2016-02-16 15:20	Belatedly I've looked in more detail at the HTML source of the cited page. The occurrences of u umlaut are coded two ways: the top block of text contains HTNL entities ("ü"), the rest are ISO8859-1 characters. My experiments above indicate that the ISO8859-1 characters are transcoded internally to UTF-8, so my fix of transcoding the search string to UTF-8 finds them. However, I haven't found a way to input anything that finds the HTML entities - even putting the HTML entity into the search text box doesn't find them - so I don't know how they are encoded and/or searched for internally to NS. I don't have a method that works enough to make it worthwhile, so I'm not submitting a patch at this point.

~0001340 Dave Higton (developer) 2016-02-28 22:24	This turns out to be more of a can of worms than I thought. One basic problem is that HTML entities, such as ü, are each put into their own boxes, so the string that includes them does not appear anywhere contiguously and therefore cannot be searched for. The cited page includes some characters that are encoded as HTML entities and some that are UTF-8 already and therefore can be searched for. The more knowledgable developers say that this cannot be fixed until we replace the layout engine.

Notes

Date Modified	Username	Field	Change
Issue History
2015-05-29 20:04	Harriet Bazley	New Issue
2015-06-01 15:48	Vincent Sanders	Status	new => acknowledged
2015-06-01 15:48	Vincent Sanders	Steps to Reproduce Updated	View Revisions
2015-06-01 15:48	Vincent Sanders	Product Version	=> 3.4
2016-02-10 20:07	Dave Higton	Note Added: 0001216
2016-02-10 20:09	Dave Higton	Note Edited: 0001216	View Revisions
2016-02-10 20:10	Dave Higton	Note Edited: 0001216	View Revisions
2016-02-11 22:32	Dave Higton	Note Added: 0001219
2016-02-11 22:32	Dave Higton	Note Edited: 0001219	View Revisions
2016-02-11 22:33	Dave Higton	Note Edited: 0001219	View Revisions
2016-02-12 22:27	Dave Higton	Note Added: 0001220
2016-02-12 22:28	Dave Higton	Assigned To	=> Dave Higton
2016-02-12 22:28	Dave Higton	Steps to Reproduce Updated	View Revisions
2016-02-12 22:29	Dave Higton	Status	acknowledged => assigned
2016-02-16 15:20	Dave Higton	Note Added: 0001290
2016-02-28 22:24	Dave Higton	Note Added: 0001340
2016-02-28 22:24	Dave Higton	Assigned To	Dave Higton =>
2016-02-28 22:24	Dave Higton	Status	assigned => acknowledged

Issue History