Next to my daily projects I am working on a new pet project, based on XPages.
May be later more about this project
One of the requests was, more a nice to have, to import lots of data in the new application, to avoid to do it manually. So I searched on Google and I found JSoup, with some nice examples and tutorials.
What is JSoup
With JSoup you can parse and manipulate HTML inside your Java code. On the website is a cookbook, where you can find lots of tutorials.
To get the elements out of the HTML you can use css or JQuery like selector syntax, very easy to use.
One of the nicest things is the Online interactive Demo.
[dropshadowbox align=”none” effect=”lifted-both” width=”600px” height=”” background_color=”#ffffff” border_width=”1″ border_color=”#dddddd” ][/dropshadowbox]
Setup JSoup
The setup is pretty straight forward if you imported jar files before in your database.
- Download the jar file
- Import in the package explorer in the WebContent/WEB-INF/lib directory the jar file
- Select the jar file and select Build path –> Add to Build path
Use JSoup
After the setup you can use JSoup inside your Java code, for me the most natural place as my application use the MVC principle.
In my case I have a non-secure start page with lots of links, it appears that it where 1504 useful links, to webpages secured by a login.
So I started with collecting the links on the non-secured website. All the links I need are inside a table, so I first get the table and query the table for the links. This way skips lots of unwanted links, like the css or javascript links.
[dropshadowbox align=”none” effect=”lifted-both” width=”400px” height=”” background_color=”#ffffff” border_width=”1″ border_color=”#dddddd” ][/dropshadowbox]
Next step was to loop through all the collected url’s, but first I need to highjack the login.
First I need the login http headers of the form data, more explained here.
When I have my login url and my credentials and the form data who is submitted to this login, I created a small method to get my session cookie.
[dropshadowbox align=”none” effect=”lifted-both” width=”400px” height=”” background_color=”#ffffff” border_width=”1″ border_color=”#dddddd” ][/dropshadowbox]
This cookie can be used by JSoup to connect to the secured website with success.
[dropshadowbox align=”none” effect=”lifted-both” width=”400px” height=”” background_color=”#ffffff” border_width=”1″ border_color=”#dddddd” ][/dropshadowbox]
When the HTML of the secured was loaded I could start collecting the required data from this page.
Conclusion
JSoup is a very easy to use Java class with a comprehensive API with lots of examples and tutorials. Especially the selector query syntax is powerful.
The above example resulted in 1504 useful links, who was harvest for the required data, which resulted in 20.488 documents in the database.