indexOf does not match unicode string

Hi!

I am reading a file called “Symbola605.txt” with loadStrings().
Then I use .indexOf at each line in order to find a word that marks the beginning of a character list.

But it seems that .indexOf does not work with unicode files, even when combined with normalize(). Neither do includes() or match().

You can see the code with the attached file here:

I will copy here the code – but remember, to see it working you have to click the url above, since this sketch needs a file.

// I'm trying to read the file "symbola605" from the line containing a word.
// but it seems indexOf does not work,
//     and neither does indexof + normalize().
//

var SymbolaChars,body;
function preload() {
  SymbolaChars = loadStrings('Symbola605.txt', getSymbolaGroup);
}

// Finds the string "Symbola" inside the string
function getSymbolaGroup(){
	var needle="Symbola".toLowerCase().normalize();
	var out="";
	var S="";
	for(f=0;f<40;f++){
		S=SymbolaChars[f].toLowerCase().normalize();
		out+="["+int(f)+"] "+S;
		out+="<br/>"
		n=S.indexOf(needle);
		if (n<0) {
			out+="does not include '"+needle+"'";
		} else {
			out+="includes '"+needle+"' at "+n;
   	}
		out+="<br/>";
	}
	createP(out);
}

Returns:


[22] symbola.ttf
does not include ‘symbola’

[26] symbola is not a merchandise.
does not include ‘symbola’

[30] symbola covers the following scripts and symbols supported by unicode:
does not include ‘symbola’

[38] character repertoire of symbola
does not include ‘symbola’

(and so on)

The issue is that your file is encoded in UTF-16LE. The p5.js function loadStrings uses the fetch API under the hood, and the folks at WHATWG, in their infinite wisdom, decided that UTF-8 should be good enough for everybody. You could open a feature request on GitHub, but the easiest solution would just be to re-encode your file as UTF-8 (here’s a quick command line if you are a linux/bsd based system: linux - Using iconv to convert from UTF-16LE to UTF-8 - Stack Overflow).

1 Like

Well, that’s what I did after many tries to use str(16) to convert from utf8 to utf16: i used notepad to convert the file.