Web Scraping code

hello all, Im super new to Processing and am trying to figure out how to write a sketch that can grab numeric data from a website, specifically wind-speed from the national weather service

Link: (https://forecast.weather.gov/MapClick.php?lat=42.383165000000076&lon=-71.10176388799994#.XQBQ9hZKipr)

I found a helpful video that ive been following along with, and so far it works pretty well, however, when i split my page and the resulting bits, i get extra information that i dont want. (Ideally i simply want a number to work with so as to use later on with an Arduino project)

For example, when i run the sketch, the console gives me something like “

N 12”. How do i get rid of the , and the N? Also, sometimes theres added info after the number (if the wind speed is fast enough to create a Gale, etc, it might give me something like “N 29 G 29”.

Hopefully im just missing something super easy. My code is below:

String[] ln = loadStrings("https://forecast.weather.gov/MapClick.php?lat=42.383165000000076&lon=-71.10176388799994#.XP2ViRZKipo");
String page1 = "";
for (int i=0;i<ln.length;i++) {
  //println(i+" "+ln[i]);
    page1 = page1+=ln[i];
}

String[] bits = split(page1, "<td class=\"text-right\"><b>Wind Speed</b></td>");

//String[] bits1 = split(bits[0], "<td>");

//println(bits.length);

String[] bits2 = split(bits[1]," mph");

//println(bits2[0]);

println(trim(bits2[0]));

Thanks in advance!!!

1 Like
String page1 = "";
for (int i=0;i<ln.length;i++) {
  //println(i+" "+ln[i]);
    page1 = page1+=ln[i];
}

can be replaced with String page1 = join(ln,""); FYI

I look into your website. If you scroll down until you see the map on the left banner, you will notice there are two buttons right under the map: kml and xml. If you click on the xml button, you will see that it is the same url with an extra parameter: “&FcstType=dwml”

In short, you get an XML version. After a few iterations I get the snippet below.

I used an xml beautifier to make sense of the data. You can always try to hit the xml link and then right click on the page and click on View page source and your browser should be able to render the xml nicely for you.

Note there is not much info about the XML function in the Processing site. If you have code auto-completion working in your PDE, it will come handy this time.

By the way, check if the info I extracted contains the field of interest. You might need to explore the values of fields 13 and 15. Not sure if you will get Gale winds. You might need to work in the other XML fields.

Kf

XML xml = loadXML("https://forecast.weather.gov/MapClick.php?lat=42.3832&lon=-71.1018&unit=0&lg=english&FcstType=dwml");
//println(xml);

//FIRST level
XML[] c = xml.getChildren("data");
println("c len",c.length);
//for (int i = 0; i < c.length; i++) {
//  println(i+"\n",c[i]);
//}

//SECOND level
XML[] cc = c[1].getChildren("parameters");
//THIS next allowed me to identify fields 11,13 and 15 as fields of interest
//printArray(cc[0].listChildren());

printArray(cc[0].getChild(11));  //direction - see example at the end
printArray(cc[0].getChild(13));  //wind-speed - see example at the end
printArray(cc[0].getChild(15));  //wind-speed - see example at the end


//THIRD level
XML ccc=cc[0].getChild(11);
println("type:",ccc.getString("type"));
println("units:",ccc.getString("units"));
println("value:",ccc.getInt("value"));


println("Done");


////Example of positions 11,13,15
//
//<direction xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" time-layout="k-p1h-n1-1" type="wind" units="degrees true">  <value>300</value>  </direction>
//<wind-speed xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" time-layout="k-p1h-n1-1" type="gust" units="knots">  <value>NA</value>  </wind-speed>
//<wind-speed xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" time-layout="k-p1h-n1-1" type="sustained" units="knots">  <value>9</value>  </wind-speed>

2 Likes

Thank you Kf, I really appreciate your response. I ran the sketch and it looks like what i want is “Child 15”. Unfortunately, it only returns a value of 0 because there isnt a “value” to return. Rather, it looks like the value (the wind speed) is in between <value>11</value> as opposed to “11”. While i think your sketch is better than my own (more elegant for sure!!!), im still stuck. Any ideas how to isolate that number? Many, many thanks!!!

Hello,

This worked:

//THIRD level
XML ccc=cc[0].getChild(11);
println("type:",ccc.getString("type"));
println("units:",ccc.getString("units"));
println("value:",ccc.getContent("value")); //Changed this from getInt to getContent!

And thank you kfrajer!

2 Likes

Glv, Kf-

Thanks a million! This did in fact work, and your code is really elegant compared to mine. I really appreciate it a ton!!!

1 Like

I’ve almost forgotten about getContent(): :man_facepalming:
Processing.org/reference/XML_getContent_.html

It is needed when a value is between 2 tags <>content</> instead of being a tag attribute: :flushed:

If it’s a number, we can be even more specific w/ getIntContent(), getFloatContent(), etc.: :nerd_face:

2 Likes

Ok, that really makes sense. Im still super new to Processing, so it really helps a ton to learn more of these techniques (i.e. getContent())

I did this to convert it to an integer:

println(ccc.getContent("value"));
int value = int(trim(ccc.getContent("value")));
println(value);.

GoToLoop answer is a better way to do this; I read it after replying.
We are all enthusiastic to help!

glv

1 Like

Hey all, i just want to say thanks again for the help. Super appreciated!!

Also, I was wondering…is there possibly a more complete reference list? For example, on the Processing reference page i didnt see getContent() or getintCOntent(), etc etc. I would love to have a more thorough understanding of Processing and how to use it. Pretty much any reference is a good reference at this point for me as i want to learn as much as possible.

I did a bit of work with XML last year and have been planning to revisit this and appreciated this topic.

A couple more on Processing page:
https://processing.org/tutorials/data/

Also search the old forum:

This was version 0.1 of some work I did last year:

It has significantly improved since then.
I made the video just in case my “live” demo at a presentation did not work.

:slight_smile:

1 Like

Oh man, this is great!!!

The real-time fetching timer is what i will need in the next iteration!!! Thanks for pointing me in a good direction for learning tools!

1 Like

Be careful not to keep fetching XML data from site in each draw() cycle!
The server may shut you out if you keep asking for data. Happened to me!

I added “state” changes so I could go through different data at the click of a mouse and only loaded it ONCE!

I got a chuckle out of my comment on the YouTube video:
“Who thought parsing through XML data would be so much fun!”

glv

1 Like

If you have code auto-completion in your IDE, then you are able to see available function calls associated to different implementations. This is a second source of reference when looking for additional functions. Another source, as pointed out by gotoloop is the source code. This is not the best but it is worth to explore it as it is quick and could resolve some immediate questions. You are also welcome to bring external libraries. For instance, if you find a Java solution in Stackoverflow, you should be able to use it in Processing.

Kf

1 Like

Kf-

Thanks for the tip! I do have it turned on, but in all honesty, ive ignored it up to now. (I use it quite a bit when i CAD in Rhino, so i understand how powerful it is)

Btw, just out of curiosity…

I added a recurring “fetch” aspect to my sketch…however, in one of theoir responses in this thread, Glv said that they got banned/blocked from a website for fetching/pinging too many times, too fast… any idea how fast is “too fast” or “too many, too fast”? In my sketch i have it set for 30 seconds. Code below for reference…

import processing.serial.*;
Serial myPort;  // Create object from Serial class


//void setup() 
//{
//  size(200,200); //make our canvas 200 x 200 pixels big
//  String portName = Serial.list()[0]; //change the 0 to a 1 or 2 etc. to match your port
//  myPort = new Serial(this, portName, 9600);
//}


int time_to_fetch;
int time_between_fetches = 30000; // Five seconds.

void draw() {

  if ( millis() > time_to_fetch ) {
    get_data();
  }

}

void get_data() {
    
  time_to_fetch = millis() + time_between_fetches;
      
  XML xml = loadXML("https://forecast.weather.gov/MapClick.php?lat=42.3832&lon=-71.1018&unit=0&lg=english&FcstType=dwml");
  //println(xml);
  String page1 = "";

  //FIRST level
  XML[] c = xml.getChildren("data");
  //println("c len",c.length);
  //for (int i = 0; i < c.length; i++) {
  //  println(i+"\n",c[i]);
  //}

  //SECOND level
  XML[] cc = c[1].getChildren("parameters");
  //THIS next allowed me to identify fields 11,13 and 15 as fields of interest
  //printArray(cc[0].listChildren());

  //printArray(cc[0].getChild(11));  //direction - see example at the end
  //printArray(cc[0].getChild(13));  //wind-speed - see example at the end
  printArray(cc[0].getChild(15));  //wind-speed - see example at the end


  //THIRD level
  XML ccc=cc[0].getChild(15);
  println("type:", ccc.getString("type"));
  println("units:", ccc.getString("units"));
  println("value:", ccc.getContent("value"));

  println("Done");


  ////Example of positions 11,13,15
  //
  //<direction xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" time-layout="k-p1h-n1-1" type="wind" units="degrees true">  <value>300</value>  </direction>
  //<wind-speed xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" time-layout="k-p1h-n1-1" type="gust" units="knots">  <value>NA</value>  </wind-speed>
  //<wind-speed xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" time-layout="k-p1h-n1-1" type="sustained" units="knots">  <value>9</value>  </wind-speed>
}

1 Like

This varies on the service that you are tapping from. For example, Twitter has norm of conducts for programmers which place limits in the number of requests per second(req/s) and daily quotas for requests. These limits are modest and for most people, they should not hit the limits unless you are pushing it. You need to know about these limits and design your application to take them into account. Note these limits depend on each site and hopefully they make it available to programmers (they should as it is to encourage best practices). If you cannot fin limits and quotas for services, you can email them.

As a developer/coder, the key point is to implement best practices in your code. Your code snippet is an excellent demonstration of a proper way to do it. At the end, it is very unlikely you want to send request at the same time as the refresh rate of your sketch. If that were the case, there are other techniques which falls under the domain of streaming.

Kf

1 Like

This topic is also further discussed here:

:slight_smile:

2 Likes

Hi everyone,
I know this is an old topic but I wanted to add something that might be useful to anyone trying to do web scraping with Processing 3 : you can use the JSoup library https://jsoup.org/, it is quite easy to install (you only have to download a JAR file and put it in a code directory in your sketchbook), the tutorials are short and clear, and you can access all the elements you want in your HTML page and manipulate them with some convenient data structures. I personally prefer it to the default XML objects you can use in Processing !

3 Likes