I collect old books about industry, technology, construction, engineering and so on.
Mostly books in my native language, Swedish.
We have a great site where people sell and buy books but there is no way to search for when the book was released or what edition a book has.
I’m wondering if there is a way to make a small program and tell it to look for books on this site and if its possible to tell the program to look at a specific place on every sell add, for a specific number span. (like 1800-1900)
(I would also be able to look for editions. To search for a specific word at a specific place in every sell add).
The site is: www.bokborsen.se I’ve talked to them but they don’t have a solution at the moment, other that using the free text search ang go through every year. The problem is that many books have a year in the title but the book itself is newer.
To limit the search there is a category where I would like to look.
Category “Teknik”. Its Technology in Swedish.
I can find two different ways of viewing this category in the site. When selecting the category from a menu on the side the adress show: www.bokborsen.se/genre/65/teknik If I chose the same category from a drop down menu in the sites search tool the adress show: www.bokborsen.se/?g=65&c=0&q=&qa=&qt=&qi=&qs=&f=1&fi=&fd=&fs=&pb=&_s=created_at&_d=desc
The info about when the book was printed and what edition it is is most often stated in the first line under the book title or author.
So, if there is a way to filter my search on this line of text I could refine the search very much!
If this is possible, there must be some kind of result from the search? A text file copying the sell add info for every result so I can look it up?. Links to every result?. a html file with the resulting sell adds lined up with image and all?. Maybe a excel file?.
Well… Maybe this already exists? I tried to use google’s advanced site search but with bad results. I couldnt tell google not to look in the headline or title of each sell add.
Here are a image of a search.
Ive selected “Teknik” (Green marker) as category and typed in “1920” (Yellow marker) in the free text search field.
The result I like is the one in the red circle. The one above is not what I was looking for.
I only want the ones where the year is in the info text and not in the title.
Just checking what you think about my question to begin with…
If you say there is a chance of making this, I might need some push in the right direction with the code.
Since the website you provided doesn’t seem to have an accessible API that you can query directly, what you are looking for is Web scraping
Web scraping is the process of gathering information by parsing and extracting data from a html page directly.
The steps might be:
Find a DOM parsing library (check this in Python or Node.JS)
Try to understand how the url of the website works when searching a category and a year so you can craft an url to search that
Focus on the data you want to get and how it looks in HTML, if the info text has a specific CSS class or id you can detect it easily and get the innerHTML text
Finally I wouldn’t use Processing for that because you are not going to display or draw anything on the screen and NodeJS or Python are more easy to deal with than Java for this kind of task.
This is my attempt to parse that site using parseXML():
parseXML() can’t parse raw HTML unless we remove lotsa syntax stuff it doesn’t accept.
After filling up my function preXmlParseFixes() w/ lotsa replace() & replaceAll(), parseXML() was finally able to successfully parse what was left of it.
It’s not perfect but I hope it’s a pretty start for you:
/**
* XML Site Parser/Scraper (v1.1.1)
* GoToLoop (2021/Dec/11)
*
* https://Discourse.Processing.org/t/
* can-i-use-processing-to-search-a-site-for-very-specific-results/34052/4
*
* https://Bokborsen.se?_p=1&c=0&f=1&g=65&q=1920
*
* body div#contentwrapper div#content-row div#main-content ul.list-books
* li.book div.content div.content-primary div.header p/h2
*/
static final String
HTTP = "https://",
SITE = "Bokborsen.se",
QUERY = "?q=1920", // 1920
SECTION = "&f=1", // Böcker
CATEGORY = "&g=65", // Teknik
FORMAT = "&c=0", // Alla Böcker
PAGE = "&_p=1", // Page 1
URL = HTTP + SITE + QUERY + SECTION + CATEGORY + FORMAT + PAGE,
YEAR = 'y' + nf(int(QUERY.substring(3)), 4), // y1920
NUM = 'p' + nf(int(PAGE.substring(4)), 4), // p0001
INFO = YEAR + '.' + NUM, // y1920.p0001
PARSABLE = "parsable.",
ENTRIES = "entries.",
EXT = ".xml",
LINKS = "links.",
EXT2= ".txt",
OPEN_TAG = "<ul class=\"list list-books\"",
CLOSE_TAG = "</ul>";
void setup() {
println(URL);
final String lines = join(loadStrings(URL), ENTER);
final int
start = lines.indexOf(OPEN_TAG),
end = lines.indexOf(CLOSE_TAG, start) + CLOSE_TAG.length();
println(lines.length(), start, end, end - start);
final String txt = preXmlParseFixes(lines.substring(start, end));
saveStrings(dataPath(PARSABLE + INFO + EXT), split(txt, ENTER));
final XML ul = parseXML(txt), liTags[] = ul.getChildren("li");
final String query = QUERY.substring(3);
deleteAllChildren(ul);
for (final XML li : liTags) if (
xmlContainsQueryInTag(li, query, "p") &&
!xmlContainsQueryInTag(li, query, "h2")) ul.addChild(li);
print("Found", ul.getChildCount(), "entries out of", liTags.length);
println(" using query '" + query + "'.");
saveXML(ul, dataPath(ENTRIES + INFO + EXT));
saveStrings(dataPath(LINKS + INFO + EXT2), getEntryLinks(ul));
exit();
}
static final boolean xmlContainsQueryInTag(
final XML xml, final String qry, final String tag)
{
final XML // li.book div.content div.content-primary div.header {tag}
divContent = xml.getChild("div"),
divPrimary = findXmlChildWithClassName(divContent, "content-primary"),
divHeader = divPrimary.getChild("div"),
child = divHeader.getChild(tag);
deleteAllChildren(divPrimary).addChild(divHeader);
deleteAllChildren(divContent).addChild(divPrimary);
deleteAllChildren(xml).addChild(divContent);
return child.getContent().contains(qry);
}
static final String[] getEntryLinks(final XML xml) {
final StringList links = new StringList();
for (final XML li : xml.getChildren("li")) {
final XML h2a = getChildDivHeader(li).getChild("h2/a");
final String href = h2a.getString("href"); // h2 a[href]
links.append(HTTP + SITE + href);
}
return links.array();
}
static final XML getChildDivHeader(final XML xml) {
return // li.book div.content div.content-primary
findXmlChildWithClassName(xml.getChild("div"), "content-primary")
.getChild("div"); // div.header
}
static final XML findXmlChildWithClassName(final XML xml, final String name) {
for (final XML child : xml.getChildren())
if (child.getString("class", "").contains(name)) return child;
return null;
}
static final XML deleteAllChildren(final XML xml) {
for (final XML child : xml.getChildren()) xml.removeChild(child);
return xml;
}
static final String preXmlParseFixes(final String xmlTxt) {
return xmlTxt
.replace("Ä", "Ä")
.replace("ä", "ä")
.replace("Å", "Å")
.replace("å", "å")
.replace("é", "é")
.replace("Ö", "Ö")
.replace("ö", "ö")
.replace("Ü", "Ü")
.replace("ü", "ü")
.replace("”", "”")
.replace("…", "\u2026")
.replace(" itemscope", "")
.replace(" data-nav-store-panel", "")
.replace(" data-show-large-product-image", "")
.replace(" data-container-description", "")
.replace(" data-control-expand-description", "")
.replace(" data-control-filter-author", "")
.replace(" data-control-clickcard", "")
.replaceAll(" data-clickcard[-]?\\d*", "")
.replace("<span-expanded class=\"hidden\">", "")
.replace("</nav>", "")
.replace("&qs=", "")
.replaceAll("<[/]?nobr>", "")
.replaceAll("\\s{24}</span>\\n\\s{44}</span>", "</span>");
}
I tried (to try) all 3 examples above with the following results:
@hotfooted’s example using jsoup is the one I got working the best right away.
I downloaded the small jsoup file and put in the program folder and all seems to work.
However, I am wondering about what “pageID” stands for?
The links shows up after the headline “pageID: 2” in the console.
Then PageID 3 and 4…
After pageID: 5 no more links show up and the program seems to just continue. Is this the result pages from "bokborsen.se"
@GoToLoop’s program gives me a teaser in the console with the link of the search.
Then a line with numbers I don’t get.
Finally a line telling me how many book were found.
At first I didn’t get it but then noticed that the code exported .xml files. What would you say is the best way of viewing those xml files? In excel I see all the data but its almost to much. In word I only get text info, no links or way of viewing images.
Really nice getting a file however!
@josephh’s program using Node.JS I never got to work?
I downloaded Node.JS, followed installations instructions and pasted the code in Visual Studio Code.
I selected Run and “Start debugging” and used Node.JS.
Result is: "Uncaught Error: Cannot find module ‘axios’ "
I haven’t used Visual much, mostly for editing 3D printing firmware so I guess I’m missing a lot of basic addons maybe? What’s the next step getting the program running? would love to try it out and getting the result you show in your post!
This is because you missed the install part. Node.JS uses NPM which is a package manager and you need to explicitly install your dependencies (in our case axios and jsdom).
Do that in your project directory (on a terminal, you can open one in VSCode):
$ npm init -y # Initialize your project with npm
$ npm install axios jsdom # Install the dependencies
This is the println() code statement: println(lines.length(), start, end, end - start);
Variable lines is the String containing the loaded webpage: final String lines = join(loadStrings(URL), ENTER);
Method length() gets the total number of characters of the String.
Variable start holds the String’s index where this substring is found: OPEN_TAG = "<ul class=\"list list-books\"" start = lines.indexOf(OPEN_TAG)
And variable end holds the last index for it: CLOSE_TAG = "</ul>" end = lines.indexOf(CLOSE_TAG, start) + CLOSE_TAG.length()
The tag <ul> stores an unordered list of <li> tags:
The webpage contains more than 1 <ul> tags though.
But the 1 containing the <li> tags for the books belongs to classes “list” & “list-books”.
That’s why the search substring is <ul class="list list-books".
The sketch extracts that <ul> opening tag until its </ul> closing tag and discards the rest using the method String::indexOf() to find both the start & end indices then String::substring() to cut off that section outta the webpage String: final String txt = preXmlParseFixes(lines.substring(start, end));
The current total number of characters after that is end - start.
Finally the sketch saves that <ul></ul> tag section as an XML file: “parsable.y1920.p0001.xml” saveStrings(dataPath(PARSABLE + INFO + EXT), split(txt, ENTER));
I’m just using a browser for it. Even Notepad can open an XML file.
After the 1st saveStrings() I use that same txt variable as the argument for parseXML() in order to get an XML object from it: final XML ul = parseXML(txt), liTags[] = ul.getChildren("li");
While the String “1920” is searched within that XML object I ended up deleting all the other children tags using my function deleteAllChildren() throughout the code:
So I’ve extended the code w/ 2 more functions: getEntryLinks() & getChildDivHeader().
I use those 2 new functions to capture all the “href” links inside the tags <h2><a href>: saveStrings(dataPath(LINKS + INFO + EXT2), getEntryLinks(ul));
I’ve edited my previous post to include the latest “XML Site Parser (v1.1.0)” version btW.
@GoToLoop, Thanks for the explanation.
One thing I cent seem to get to work is to edit the search word “1920” to any other year? Im getting fatal error!
@josephh, I have tried installing with the lines you wrote above and there is something happening but I still cant run your program… Guess I’m spoiled with Processing being easy to use and easy to add files for beginners compared to visual code.