Next to my daily projects I am working on a new pet project, based on XPages.
May be later more about this project 😉
One of the requests was, more a nice to have, to import lots of data in the new application, to avoid to do it manually. So I searched on Google and I found JSoup, with some nice examples and tutorials.
What is JSoup
With JSoup you can parse and manipulate HTML inside your Java code. On the website is a cookbook, where you can find lots of tutorials.
To get the elements out of the HTML you can use css or JQuery like selector syntax, very easy to use.
One of the nicest things is the Online interactive Demo.
The setup is pretty straight forward if you imported jar files before in your database.
- Download the jar file
- Import in the package explorer in the WebContent/WEB-INF/lib directory the jar file
- Select the jar file and select Build path –> Add to Build path
After the setup you can use JSoup inside your Java code, for me the most natural place as my application use the MVC principle.
In my case I have a non-secure start page with lots of links, it appears that it where 1504 useful links, to webpages secured by a login.
Next step was to loop through all the collected url’s, but first I need to highjack the login.
First I need the login http headers of the form data, more explained here.
When I have my login url and my credentials and the form data who is submitted to this login, I created a small method to get my session cookie.
This cookie can be used by JSoup to connect to the secured website with success.
When the HTML of the secured was loaded I could start collecting the required data from this page.
JSoup is a very easy to use Java class with a comprehensive API with lots of examples and tutorials. Especially the selector query syntax is powerful.
The above example resulted in 1504 useful links, who was harvest for the required data, which resulted in 20.488 documents in the database.by