indexOf does not match unicode string

JoseMY · July 18, 2021, 1:53pm

Hi!

I am reading a file called “Symbola605.txt” with loadStrings().
Then I use .indexOf at each line in order to find a word that marks the beginning of a character list.

But it seems that .indexOf does not work with unicode files, even when combined with normalize(). Neither do includes() or match().

You can see the code with the attached file here:

I will copy here the code – but remember, to see it working you have to click the url above, since this sketch needs a file.

// I'm trying to read the file "symbola605" from the line containing a word.
// but it seems indexOf does not work,
//     and neither does indexof + normalize().
//

var SymbolaChars,body;
function preload() {
  SymbolaChars = loadStrings('Symbola605.txt', getSymbolaGroup);
}

// Finds the string "Symbola" inside the string
function getSymbolaGroup(){
	var needle="Symbola".toLowerCase().normalize();
	var out="";
	var S="";
	for(f=0;f<40;f++){
		S=SymbolaChars[f].toLowerCase().normalize();
		out+="["+int(f)+"] "+S;
		out+="<br/>"
		n=S.indexOf(needle);
		if (n<0) {
			out+="does not include '"+needle+"'";
		} else {
			out+="includes '"+needle+"' at "+n;
   	}
		out+="<br/>";
	}
	createP(out);
}

Returns:

…
[22] symbola.ttf
does not include ‘symbola’
…
[26] symbola is not a merchandise.
does not include ‘symbola’
…
[30] symbola covers the following scripts and symbols supported by unicode:
does not include ‘symbola’
…
[38] character repertoire of symbola
does not include ‘symbola’
…
(and so on)

KumuPaul · July 19, 2021, 8:20pm

The issue is that your file is encoded in UTF-16LE. The p5.js function loadStrings uses the fetch API under the hood, and the folks at WHATWG, in their infinite wisdom, decided that UTF-8 should be good enough for everybody. You could open a feature request on GitHub, but the easiest solution would just be to re-encode your file as UTF-8 (here’s a quick command line if you are a linux/bsd based system: linux - Using iconv to convert from UTF-16LE to UTF-8 - Stack Overflow).

JoseMY · July 20, 2021, 12:00pm

Well, that’s what I did after many tries to use str(16) to convert from utf8 to utf16: i used notepad to convert the file.

Topic		Replies	Views
Loadstrings with Umlauts Umlaut Umlaute / CSV from Excel Beginners	10	1090	December 7, 2019
String Arrays with LTR and RTL languages Beginners	13	84	December 26, 2024
Unicode and Smart quotes not being drawn properly Coding Questions	4	784	August 6, 2019
Loading a file results in weird chars Beginners	4	363	July 22, 2018
Problem with Unicode Coding Questions	3	781	May 15, 2020

indexOf does not match unicode string

Related topics