My regular expression is not working

I am trying to parse this string to find attribute assignments
<body width="500" linegap="2" paragap="8" align="justify" font="courier new">'

An attribute assignment is of the format
name="value"

where the format is

  • 1 or more word characters
  • 0 or more spaces
  • = symbol
  • 0 or more spaces
  • 1 " or ’ character
  • 0 or 1 minus sign
  • 1 or more word characters
  • 0 or 1 space
  • 1 or more word characters
  • 1 " or ’ character

So here is the sketch
The regular expression is

/\w+\s*=\s*[\'\"]-?\w+\s?\w+[\'\"]/g

My sketch is shown below but when I run it I get this output

============  Nbr Tokens 3   ==============

-----------------------------------------------------
Pos:  000   Length:  011    Token: |width="500"|  
Pos:  001   Length:  015    Token: |align="justify"|  
Pos:  002   Length:  018    Token: |font="courier new"|

Notice that 2 of the attribute assignments are not found. I admit I am no expert in using regular expression so I assume I am missing something obvious

Any help would be appreciated.

Sketch code

function setup() {
    createCanvas(640, 400);
    let s = '<body width="500" linegap="2" paragap="8" align="justify" font="courier new">';
    let rgx1 = /\w+\s*=\s*[\'\"]-?\w+\s?\w+[\'\"]/g;
    let r = s.match(rgx1);

    let n = 0, out = ['\n'];
    out.push(`============  Nbr Tokens ${r ? r.length : 0}   ==============`);
    out.push(s);
    out.push('-----------------------------------------------------')
    r?.forEach(e => out.push(`Pos: ${nfs(n++, 3, 0)}   Length: ${nfs(e.length, 3, 0)}    Token: |${e}|  `));
    console.log(out.join('\n'));
}

Here’s an attempt at a fix: Waiting medusaceratops by davepagurek -p5.js Web Editor

My regex looks like this: /\w+\s*=\s*['"][^'"]*['"]/g

The main difference is that between the quotes, rather than trying to match possibly multiple words with possible spaces in it, which would fail to match anything with less than 2 characters, I’m just matching “any number of non-quote characters” ([^'"]*) in between (including 0 characters in length, for an empty attribute.)

Edit: also in case it helps with understanding, your regex would match all attributes in your example if you change the last two \w+s into \w*. Having two \w+s means a minimum of two word characters required within the quotes, which is why the single-digit attributes aren’t getting caught. But probably the more general approach above would help for things like three-word attributes, or ones with other characters, like <div style="height: 200px">.

2 Likes

This is part of a bigger project and like most of the stuff I create it evolves.

When I started the project I anticipated the the “value” being either

  1. a positive integer, or
  2. a word comprising letters and numbers but no spaces

so the regex was very simple, later I realised I might need negative integers so incorporated the -? into the regex.

Later still I realised some string values might have a space e.g. courier new so hacked at the regex not realising it now required a minimum of 2 characters for any match.

So thank you for replying, the blinkers just fell away after ready your post. :grinning:

You’re regex is very neat and a good starting point. Since I don’t want zero-length values I have changed the regex to /\w+\s*=\s*['"]-?\w+[\w ]*['"]/g as it provides a closer match to the value format I want. I just need to test it fully and incorporate it into my project.

Thanks again for a very swift response. :+1: