Can I use Processing to search a site for very specific results?

Teljemo · December 11, 2021, 7:21am

Hi,

I collect old books about industry, technology, construction, engineering and so on.
Mostly books in my native language, Swedish.
We have a great site where people sell and buy books but there is no way to search for when the book was released or what edition a book has.

I’m wondering if there is a way to make a small program and tell it to look for books on this site and if its possible to tell the program to look at a specific place on every sell add, for a specific number span. (like 1800-1900)
(I would also be able to look for editions. To search for a specific word at a specific place in every sell add).

The site is: www.bokborsen.se
I’ve talked to them but they don’t have a solution at the moment, other that using the free text search ang go through every year.
The problem is that many books have a year in the title but the book itself is newer.
To limit the search there is a category where I would like to look.
Category “Teknik”. Its Technology in Swedish.
I can find two different ways of viewing this category in the site.
When selecting the category from a menu on the side the adress show: www.bokborsen.se/genre/65/teknik
If I chose the same category from a drop down menu in the sites search tool the adress show: www.bokborsen.se/?g=65&c=0&q=&qa=&qt=&qi=&qs=&f=1&fi=&fd=&fs=&pb=&_s=created_at&_d=desc

The info about when the book was printed and what edition it is is most often stated in the first line under the book title or author.
So, if there is a way to filter my search on this line of text I could refine the search very much!

If this is possible, there must be some kind of result from the search? A text file copying the sell add info for every result so I can look it up?. Links to every result?. a html file with the resulting sell adds lined up with image and all?. Maybe a excel file?.

Well… Maybe this already exists? I tried to use google’s advanced site search but with bad results. I couldnt tell google not to look in the headline or title of each sell add.

Here are a image of a search.
Ive selected “Teknik” (Green marker) as category and typed in “1920” (Yellow marker) in the free text search field.
The result I like is the one in the red circle. The one above is not what I was looking for.
I only want the ones where the year is in the info text and not in the title.

Just checking what you think about my question to begin with…
If you say there is a chance of making this, I might need some push in the right direction with the code.

Thank you!

josephh · December 11, 2021, 12:39pm

Hi @Teljemo,

Since the website you provided doesn’t seem to have an accessible API that you can query directly, what you are looking for is Web scraping

Web scraping is the process of gathering information by parsing and extracting data from a html page directly.

The steps might be:

Find a DOM parsing library (check this in Python or Node.JS)
Try to understand how the url of the website works when searching a category and a year so you can craft an url to search that
Focus on the data you want to get and how it looks in HTML, if the info text has a specific CSS class or id you can detect it easily and get the innerHTML text

Finally I wouldn’t use Processing for that because you are not going to display or draw anything on the screen and NodeJS or Python are more easy to deal with than Java for this kind of task.

Have fun!

hotfooted · December 11, 2021, 2:09pm

you could use jsoup

and do something like

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.internal.*;
import org.jsoup.parser.*;
import org.jsoup.safety.*;
import org.jsoup.select.*;
import org.jsoup.helper.*;

void setup() {
  int pageId = 1;
  String date = "1920";
  while (true) {
    String url = "https://www.bokborsen.se/?_p=" + pageId + "&f=1&g=65&q=" + date;
    Document doc;

    try {
      doc = Jsoup.connect(url).get();
    }
    catch (IOException e) {
      e.printStackTrace();
      doc = null;
    }

    if (doc != null) {
      println("pageId: " + ++pageId);
      
      Elements bookListContents = doc.getElementsByClass("content-primary");
      for(Element content : bookListContents) {
        String info = content.select("p").text();
        if(info.contains(date)) {
          Element link = content.select("a[href]").first();
          println(link.attr("abs:href"));
          println();
        }
      }
    }
    else {
      break;
    }
  }
}

which will iterate the pages and return lists of url for those books which meet the date given something like

pageId: 2
https://www.bokborsen.se/view/-/Vetenskapen-Och-Livet-Årgång-6-1921/10907834
https://www.bokborsen.se/view/Bröderna-Rosenlöd-Utgivare/Praktiska-Tabeller-Utvisande-Kubikinnehållet-Af-R/10902850
https://www.bokborsen.se/view/E-Andreen-H-Holst/Elektriciteten-De-Elektriska-Krafternas-Alstring-/10006093
…

obv i can’t read the language but from what i could see this works and with further tweaking should be able to be tailored to fit your needs.

GoToLoop · December 11, 2021, 9:59pm

This is my attempt to parse that site using parseXML():

parseXML() can’t parse raw HTML unless we remove lotsa syntax stuff it doesn’t accept.

After filling up my function preXmlParseFixes() w/ lotsa replace() & replaceAll(), parseXML() was finally able to successfully parse what was left of it.

It’s not perfect but I hope it’s a pretty start for you:

/**
 * XML Site Parser/Scraper (v1.1.1)
 * GoToLoop (2021/Dec/11)
 *
 * https://Discourse.Processing.org/t/
 * can-i-use-processing-to-search-a-site-for-very-specific-results/34052/4
 *
 * https://Bokborsen.se?_p=1&c=0&f=1&g=65&q=1920
 *
 * body div#contentwrapper div#content-row div#main-content ul.list-books
 * li.book div.content div.content-primary div.header p/h2
 */

static final String
  HTTP = "https://", 
  SITE = "Bokborsen.se", 

  QUERY = "?q=1920", // 1920
  SECTION = "&f=1", // Böcker
  CATEGORY = "&g=65", // Teknik
  FORMAT = "&c=0", // Alla Böcker
  PAGE = "&_p=1", // Page 1

  URL = HTTP + SITE + QUERY + SECTION + CATEGORY + FORMAT + PAGE, 

  YEAR = 'y' + nf(int(QUERY.substring(3)), 4), // y1920
  NUM = 'p' + nf(int(PAGE.substring(4)), 4), // p0001
  INFO = YEAR + '.' + NUM, // y1920.p0001

  PARSABLE = "parsable.", 
  ENTRIES = "entries.", 
  EXT = ".xml", 

  LINKS = "links.", 
  EXT2= ".txt", 

  OPEN_TAG = "<ul class=\"list list-books\"", 
  CLOSE_TAG = "</ul>";

void setup() {
  println(URL);

  final String lines = join(loadStrings(URL), ENTER);

  final int
    start = lines.indexOf(OPEN_TAG), 
    end = lines.indexOf(CLOSE_TAG, start) + CLOSE_TAG.length();

  println(lines.length(), start, end, end - start);

  final String txt = preXmlParseFixes(lines.substring(start, end));
  saveStrings(dataPath(PARSABLE + INFO + EXT), split(txt, ENTER));

  final XML ul = parseXML(txt), liTags[] = ul.getChildren("li");
  final String query = QUERY.substring(3);

  deleteAllChildren(ul);

  for (final XML li : liTags)  if (
    xmlContainsQueryInTag(li, query, "p") &&
    !xmlContainsQueryInTag(li, query, "h2"))  ul.addChild(li);

  print("Found", ul.getChildCount(), "entries out of", liTags.length);
  println(" using query '" + query + "'.");
  saveXML(ul, dataPath(ENTRIES + INFO + EXT));

  saveStrings(dataPath(LINKS + INFO + EXT2), getEntryLinks(ul));
  exit();
}

static final boolean xmlContainsQueryInTag(
  final XML xml, final String qry, final String tag)
{
  final XML // li.book div.content div.content-primary div.header {tag}
    divContent = xml.getChild("div"), 
    divPrimary = findXmlChildWithClassName(divContent, "content-primary"), 
    divHeader = divPrimary.getChild("div"), 
    child = divHeader.getChild(tag);

  deleteAllChildren(divPrimary).addChild(divHeader);
  deleteAllChildren(divContent).addChild(divPrimary);
  deleteAllChildren(xml).addChild(divContent);

  return child.getContent().contains(qry);
}

static final String[] getEntryLinks(final XML xml) {
  final StringList links = new StringList();

  for (final XML li : xml.getChildren("li")) {
    final XML h2a = getChildDivHeader(li).getChild("h2/a");
    final String href = h2a.getString("href"); // h2 a[href]

    links.append(HTTP + SITE + href);
  }

  return links.array();
}

static final XML getChildDivHeader(final XML xml) {
  return // li.book div.content div.content-primary
    findXmlChildWithClassName(xml.getChild("div"), "content-primary")
    .getChild("div"); // div.header
}

static final XML findXmlChildWithClassName(final XML xml, final String name) {
  for (final XML child : xml.getChildren())
    if (child.getString("class", "").contains(name))  return child;
  return null;
}

static final XML deleteAllChildren(final XML xml) {
  for (final XML child : xml.getChildren())  xml.removeChild(child);
  return xml;
}

static final String preXmlParseFixes(final String xmlTxt) {
  return xmlTxt
    .replace("&Auml;", "Ä")
    .replace("&auml;", "ä")
    .replace("&Aring;", "Å")
    .replace("&aring;", "å")
    .replace("&eacute;", "é")
    .replace("&Ouml;", "Ö")
    .replace("&ouml;", "ö")
    .replace("&Uuml;", "Ü")
    .replace("&uuml;", "ü")

    .replace("&rdquo;", "”")
    .replace("&hellip;", "\u2026")

    .replace(" itemscope", "")
    .replace(" data-nav-store-panel", "")
    .replace(" data-show-large-product-image", "")
    .replace(" data-container-description", "")
    .replace(" data-control-expand-description", "")
    .replace(" data-control-filter-author", "")
    .replace(" data-control-clickcard", "")
    .replaceAll(" data-clickcard[-]?\\d*", "")

    .replace("<span-expanded class=\"hidden\">", "")
    .replace("</nav>", "")
    .replace("&qs=", "")
    .replaceAll("<[/]?nobr>", "")

    .replaceAll("\\s{24}</span>\\n\\s{44}</span>", "</span>");
}

josephh · December 11, 2021, 10:45pm

For the sake of showing different approaches, here is a Node.JS implementation:

Setup:

$ mkdir bokborsen_scrap
$ cd bokborsen_scrap
$ npm init -y
$ npm i axios jsdom

Code:

// index.js

const axios = require("axios")
const { JSDOM } = require("jsdom")

const baseURL = "https://www.bokborsen.se"
const category = "65"
const query = "1920"
const url = `${baseURL}/?g=${category}&q=${query}`

axios.get(url).then(response => {
  const dom = new JSDOM(response.data)

  const products = dom.window.document.getElementsByClassName("single-product")

  for (const product of products) {
    const header = product.getElementsByClassName("header")[0]

    const title = header.querySelector("[itemprop='name']")
    const titleContent = title.innerHTML.trim()

    const linkToBook = title.parentNode.href
    
    const infoText = header.getElementsByTagName("p")[0]
    const infoTextContent = infoText.innerHTML.replace(/<strong>.*<\/strong>/g, "").trim()

    if (infoTextContent.includes(query)) {
      console.log("BOOK FOUND!")
      console.log(`- ${titleContent}\n`)
      console.log(`-> ${baseURL}${linkToBook}`)
      console.log(infoTextContent)
      console.log("-----------------------")
    }
  }
})

Which gives:

$ node index.js

BOOK FOUND!
- Vetenskapen och livet: årgång 6, 1921

-> https://www.bokborsen.se/view/-/Vetenskapen-Och-Livet-%C3%85rg%C3%A5ng-6-1921/10907834
Hugo Gebers förlag. 1920.
-----------------------
BOOK FOUND!
- Praktiska tabeller utvisande kubikinnehållet...

-> https://www.bokborsen.se/view/Br%C3%B6derna-Rosenl%C3%B6d-Utgivare/Praktiska-Tabeller-Utvisande-Kubikinneh%C3%A5llet-Af-R/10902850
Kungsgården. 4 uppl. 1920. 28 sidor.
-----------------------
BOOK FOUND!
- Elektriciteten. De elektriska krafternas alst...

-> https://www.bokborsen.se/view/E-Andreen-H-Holst/Elektriciteten-De-Elektriska-Krafternas-Alstring-/10006093
Fröléen &amp; Comp. Stockholm. 1920. 760 sidor.
-----------------------
BOOK FOUND!
- Elektriciteten. de elektriska krafternas alst...

-> https://www.bokborsen.se/view/E-AndreenH-Holst/Elektriciteten-De-Elektriska-Krafternas-Alstring-/4068315
Fröleén &amp; comp. 1920. 760 sidor.
-----------------------
BOOK FOUND!
- Sonessons hufvud-katalog. Band B innehållande...

-> https://www.bokborsen.se/view/Wilh-Sonesson-Co/Sonessons-Hufvud-Katalog-Band-B-Inneh%C3%A5llande-Afd/10123136
Wilh. Sonesson &amp; Co. 1:a uppl. 1920. 288 sidor.
-----------------------
BOOK FOUND!
- GENERAL CATALOGUE OF OPENING, PREPARING, SPIN...

-> https://www.bokborsen.se/view/-/General-Catalogue-Of-Opening-Preparing-Spinning-/10060926
Tweedales &amp; Smalley (1920) Ltd. 343 sidor.
-----------------------
BOOK FOUND!
- Handbok för stålbehandlare

-> https://www.bokborsen.se/view/Gustav-Hallin/Handbok-F%C3%B6r-St%C3%A5lbehandlare/9900434
Josef Bergendahls boktryckeri. 1920. 372 sidor.
-----------------------
BOOK FOUND!
- Davids rådgivare i fotografi

-> https://www.bokborsen.se/view/Ludwig-David/Davids-R%C3%A5dgivare-I-Fotografi/9730410
AB Chelius &amp; Co. 1920. 183 sidor.
-----------------------
BOOK FOUND!
- Världsindustrien - Teknikens segertåg i ord o...

-> https://www.bokborsen.se/view/Max-Geitel/V%C3%A4rldsindustrien-Teknikens-Segert%C3%A5g-I-Ord-Och-/9484257
Nordiska bokförlaget. 1920. 1360 sidor.
-----------------------
BOOK FOUND!
- Världsindustrien Del I och II. Teknikens sege...

-> https://www.bokborsen.se/view/Max-Geitel/V%C3%A4rldsindustrien-Del-I-Och-Ii-Teknikens-Segert%C3%A5/9406506
Svenska tryckeriaktiebolaget. 1920. 1.760, 761-1360 sidor.
-----------------------
BOOK FOUND!
- Elektricitet

-> https://www.bokborsen.se/view/Andreen-E-Holst-H/Elektricitet/9199029
Fröléen. 3. genomsedda och utökade upplagan uppl. 1920. 760 sidor.
-----------------------

....

It’s not perfect but it checks that the query (1920) is in the book description.

Teljemo · December 14, 2021, 6:29am

Wow, Awesome!
Thank you guys!

I tried (to try) all 3 examples above with the following results:

@hotfooted’s example using jsoup is the one I got working the best right away.
I downloaded the small jsoup file and put in the program folder and all seems to work.
However, I am wondering about what “pageID” stands for?
The links shows up after the headline “pageID: 2” in the console.
Then PageID 3 and 4…
After pageID: 5 no more links show up and the program seems to just continue.
Is this the result pages from "bokborsen.se"

@GoToLoop’s program gives me a teaser in the console with the link of the search.
Then a line with numbers I don’t get.
Finally a line telling me how many book were found.
At first I didn’t get it but then noticed that the code exported .xml files.
What would you say is the best way of viewing those xml files? In excel I see all the data but its almost to much. In word I only get text info, no links or way of viewing images.
Really nice getting a file however!

@josephh’s program using Node.JS I never got to work?
I downloaded Node.JS, followed installations instructions and pasted the code in Visual Studio Code.
I selected Run and “Start debugging” and used Node.JS.
Result is: "Uncaught Error: Cannot find module ‘axios’ "
I haven’t used Visual much, mostly for editing 3D printing firmware so I guess I’m missing a lot of basic addons maybe?
What’s the next step getting the program running? would love to try it out and getting the result you show in your post!

josephh · December 14, 2021, 7:24am

This is because you missed the install part. Node.JS uses NPM which is a package manager and you need to explicitly install your dependencies (in our case axios and jsdom).

Do that in your project directory (on a terminal, you can open one in VSCode):

$ npm init -y # Initialize your project with npm
$ npm install axios jsdom # Install the dependencies

GoToLoop · December 14, 2021, 2:39pm

This is the println() code statement:
println(lines.length(), start, end, end - start);

Variable lines is the String containing the loaded webpage:
final String lines = join(loadStrings(URL), ENTER);

Method length() gets the total number of characters of the String.

Variable start holds the String’s index where this substring is found:
OPEN_TAG = "<ul class=\"list list-books\""
start = lines.indexOf(OPEN_TAG)

And variable end holds the last index for it:
CLOSE_TAG = "</ul>"
end = lines.indexOf(CLOSE_TAG, start) + CLOSE_TAG.length()

The tag <ul> stores an unordered list of <li> tags:

The webpage contains more than 1 <ul> tags though.

But the 1 containing the <li> tags for the books belongs to classes “list” & “list-books”.

That’s why the search substring is <ul class="list list-books".

The sketch extracts that <ul> opening tag until its </ul> closing tag and discards the rest using the method String::indexOf() to find both the start & end indices then String::substring() to cut off that section outta the webpage String:
final String txt = preXmlParseFixes(lines.substring(start, end));

The current total number of characters after that is end - start.

Finally the sketch saves that <ul></ul> tag section as an XML file: “parsable.y1920.p0001.xml”
saveStrings(dataPath(PARSABLE + INFO + EXT), split(txt, ENTER));

I’m just using a browser for it. Even Notepad can open an XML file.

After the 1st saveStrings() I use that same txt variable as the argument for parseXML() in order to get an XML object from it:
final XML ul = parseXML(txt), liTags[] = ul.getChildren("li");

While the String “1920” is searched within that XML object I ended up deleting all the other children tags using my function deleteAllChildren() throughout the code:

deleteAllChildren(divPrimary).addChild(divHeader);
deleteAllChildren(divContent).addChild(divPrimary);
deleteAllChildren(xml).addChild(divContent);

static final XML deleteAllChildren(final XML xml) {
  for (final XML child : xml.getChildren())  xml.removeChild(child);
  return xml;
}

So not much is left after that of course!

However, there are still “hidden” links throughout the final saved file: “entries.y1920.p0001.xml”
saveXML(ul, dataPath(ENTRIES + INFO + EXT));

But they’re all incomplete links like this 1:

<h2>
  <a href="/view/Br%C3%B6derna-Rosenl%C3%B6d-Utgivare/Praktiska-Tabeller-Utvisande-Kubikinneh%C3%A5llet-Af-R/10902850">
    <span
      data-expanded-value="Praktiska tabeller utvisande kubikinnehållet af rundt virke, plank och bräder"
      itemprop="name">Praktiska tabeller utvisande kubikinnehållet...
    </span>
  </a>
</h2>

Can you spot the link "/view/Br%C3%B6derna-Rosenl%C3%B6d-Utgivare/Praktiska-Tabeller-Utvisande-Kubikinneh%C3%A5llet-Af-R/10902850" there?

If you concatenate it w/ "https://Bokborsen.se" you’ll get a complete valid link:
https://Bokborsen.se/view/Bröderna-Rosenlöd-Utgivare/Praktiska-Tabeller-Utvisande-Kubikinnehållet-Af-R/10902850

So I’ve extended the code w/ 2 more functions: getEntryLinks() & getChildDivHeader().

I use those 2 new functions to capture all the “href” links inside the tags <h2><a href>:
saveStrings(dataPath(LINKS + INFO + EXT2), getEntryLinks(ul));

I’ve edited my previous post to include the latest “XML Site Parser (v1.1.0)” version btW.

“links.y1920.p0001.txt”:

https://Bokborsen.se/view/-/Vetenskapen-Och-Livet-%C3%85rg%C3%A5ng-6-1921/10907834
https://Bokborsen.se/view/Br%C3%B6derna-Rosenl%C3%B6d-Utgivare/Praktiska-Tabeller-Utvisande-Kubikinneh%C3%A5llet-Af-R/10902850
https://Bokborsen.se/view/E-Andreen-H-Holst/Elektriciteten-De-Elektriska-Krafternas-Alstring-/10006093
https://Bokborsen.se/view/E-AndreenH-Holst/Elektriciteten-De-Elektriska-Krafternas-Alstring-/4068315
https://Bokborsen.se/view/Wilh-Sonesson-Co/Sonessons-Hufvud-Katalog-Band-B-Inneh%C3%A5llande-Afd/10123136
https://Bokborsen.se/view/-/General-Catalogue-Of-Opening-Preparing-Spinning-/10060926
https://Bokborsen.se/view/Gustav-Hallin/Handbok-F%C3%B6r-St%C3%A5lbehandlare/9900434
https://Bokborsen.se/view/Ludwig-David/Davids-R%C3%A5dgivare-I-Fotografi/9730410
https://Bokborsen.se/view/Max-Geitel/V%C3%A4rldsindustrien-Teknikens-Segert%C3%A5g-I-Ord-Och-/9484257
https://Bokborsen.se/view/Max-Geitel/V%C3%A4rldsindustrien-Del-I-Och-Ii-Teknikens-Segert%C3%A5/9406506
https://Bokborsen.se/view/Andreen-E-Holst-H/Elektricitet/9199029
https://Bokborsen.se/view/Max-Geitel/V%C3%A4rldsindustrien-Teknikens-Segert%C3%A5g-I-Ord-Och-B/9192963
https://Bokborsen.se/view/E-Andreen-H-Holst/Elektriciteten-De-Elektriska-Krafternas-Alstring-/9076555
https://Bokborsen.se/view/Max-Geitel/V%C3%A4rldsindustrien-Teknikens-Segert%C3%A5g-I-Ord-Och-/8962598
https://Bokborsen.se/view/Paavo-Pero/Mekanisk-Teknologi/8861707
https://Bokborsen.se/view/-/Luth-Ros%C3%A9ns-Elektriska-Aktiebolag-Stockholm-Ka/8687285
https://Bokborsen.se/view/Max-Geitel/V%C3%A4rldsindustrien-Teknikens-Segert%C3%A5g-I-Ord-Och-Bi/8461699

Teljemo · December 20, 2021, 4:06pm

Hi,

@GoToLoop, Thanks for the explanation.
One thing I cent seem to get to work is to edit the search word “1920” to any other year? Im getting fatal error!

@josephh, I have tried installing with the lines you wrote above and there is something happening but I still cant run your program… Guess I’m spoiled with Processing being easy to use and easy to add files for beginners compared to visual code.

GoToLoop · December 20, 2021, 4:55pm

The QUERY constant has to follow the model:

So if you wanna search for 1930 instead the constant is exactly this way now:
QUERY = "?q=1930",

And this is the new output log for it:

https://Bokborsen.se?q=1930&f=1&g=65&c=0&_p=1
324656 122916 298438 175522
Found 14 entries out of 30 using query '1930'.

josephh · December 20, 2021, 11:52pm

if you could provide the output of the console, it would help to determine what you did wrong or what is not working

Topic		Replies	Views
Web Scraping code Coding Questions	17	2667	October 6, 2020
Help needed on loading table and creating search box to search filter data (Bible) Beginners	12	582	February 16, 2021
Create image from interaction with website Project Guidance	6	793	March 16, 2020
Web parsing through a Harvey Norman Coding Questions	2	313	April 15, 2020
Problem with reading a html file Coding Questions	2	282	February 20, 2022

Can I use Processing to search a site for very specific results?

“links.y1920.p0001.txt”:

Related topics