Parsing HTML with Jsoup

I am attempting to use the Jsoup library to parse HTML but the most basic code does not work.

First, the importer generates this:

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.internal.*;
import org.jsoup.parser.*;
import org.jsoup.safety.*;
import org.jsoup.select.*;
import org.jsoup.helper.*;

The code is as follows:

    String url = "https://en.wikipedia.org/wiki/Main_Page";
    Document doc = Jsoup.connect(url).get();
    print(doc.title());
    Elements newsHeadlines = doc.select("#mp-itn b a");
    for (Element headline : newsHeadlines) {
    	print("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));
    }

Error: “Unhandled exception type IOException”.

I have tried this with other URLs, including plain HTTP.

Has anyone successfully used this library or have advice about another?

1 Like
1 Like

OK so the following does work. But my peace of mind is shattered, because why would a try block make the IO error simply vanish? Surely, if the error was there, it should now be reported?

	String url = "https://en.wikipedia.org/wiki/Main_Page";
	Document doc;

	try {
		doc = Jsoup.connect(url).get();
	} catch (IOException e) {
		e.printStackTrace();
		doc = null;
	}

	if (doc != null) {    
	        print(doc.title());
        	Elements newsHeadlines = doc.select("#mp-itn b a");
	        for (Element headline : newsHeadlines) {
        		print("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));
	        }
	}
1 Like

You are misunderstanding.

The error you saw is at compile-time, not run-time. In order for your code to use Jsoup, it must handle the exception, otherwise your code is invalid. Adding a try-catch makes the code valid, so you get no compile-time error. You are seeing no runtime error because there never was a runtime error – but there could be, and now your code would handle it as required.

Java has a feature called “checked exceptions”. That means that there are certain kinds of exceptions, namely those that subclass Exception but not RuntimeException, such that if a method may throw them, it must list them in its throws declaration, say: void readData() throws IOException. IOException is one of those. Thus, when you are calling a method that lists IOException in its throws declaration, you must either list it in your own throws declaration or catch it. java - Why do I get the "Unhandled exception type IOException"? - Stack Overflow

2 Likes

Thanks. I am unfamiliar with Java but your explanation helps.