elstar IT

Fullstack | Java | Tech Speaker | Tech Coach | Frank van der Linden

  • About me
  • Blog license
  • My Open source projects

My take on JSoup

02-06-2015 no responses flinden68 development

Next to my daily projects I am working on a new pet project, based on XPages.

May be later more about this project 😉

One of the requests was, more a nice to have, to import lots of data in the new application, to avoid to do it manually. So I searched on Google and I found JSoup, with some nice examples and tutorials.

 

What is JSoup

With JSoup you can parse and manipulate HTML inside your Java code. On the website is a cookbook, where you can find lots of tutorials.

To get the elements out of the HTML you can use css or JQuery like selector syntax, very easy to use.

One of the nicest things is the Online interactive Demo.

[dropshadowbox align=”none” effect=”lifted-both” width=”600px” height=”” background_color=”#ffffff” border_width=”1″ border_color=”#dddddd” ]JSoupOnlineDemo[/dropshadowbox]

 

Setup JSoup

The setup is pretty straight forward if you imported jar files before in your database.

  • Download the jar file
  • Import in the package explorer in the WebContent/WEB-INF/lib directory the jar file
  • Select the jar file and select Build path –> Add to Build path

 

Use JSoup

After the setup you can use JSoup inside your Java code, for me the most natural place as my application use the MVC principle.

In my case I have a non-secure start page with lots of links, it appears that it where 1504 useful links, to webpages secured by a login.

So I started with collecting the links on the non-secured website. All the links I need are inside a table, so I first get the table and query the table for the links. This way skips lots of unwanted links, like the css or javascript links.

[dropshadowbox align=”none” effect=”lifted-both” width=”400px” height=”” background_color=”#ffffff” border_width=”1″ border_color=”#dddddd” ]JSoup - getAllLinks[/dropshadowbox]

 

Next step was to loop through all the collected url’s, but first I need to highjack the login.

First I need the login http headers of the form data, more explained here.

When I have my login url and my credentials and the form data who is submitted to this login, I created a small method to get my session cookie.

[dropshadowbox align=”none” effect=”lifted-both” width=”400px” height=”” background_color=”#ffffff” border_width=”1″ border_color=”#dddddd” ]JSoup - getSessionId[/dropshadowbox]

 

This cookie can be used by JSoup to connect to the secured website with success.

[dropshadowbox align=”none” effect=”lifted-both” width=”400px” height=”” background_color=”#ffffff” border_width=”1″ border_color=”#dddddd” ]JSoup - loop through links[/dropshadowbox]

 

When the HTML of the secured was loaded I could start collecting the required data from this page.

 

Conclusion

JSoup is a very easy to use Java class with a comprehensive API with lots of examples and tutorials. Especially the selector query syntax is powerful.

The above example resulted in 1504 useful links, who was harvest for the required data, which resulted in 20.488 documents in the database.

Tags: development, java, jsoup, xpages

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • « Power of Themes part 2, the XML syntax
  • Quick XPages tip: Show validation status icon on bootstrap tabs »

Contact me

My name is Frank van der Linden and I am an independent software developer based in the Netherlands. The last 2 years I was awarded as IBM Champion. Also I am on the board of OpenNTF. My specialisations are Java, Web development and Domino.


If you want to hire me, please fill in the Contact form


IBM Champion web badge
Apache Logo

All the code on this blog are under the Apache License 2.0. For more details, see Apache License 2.0

Most recent posts

  • Engage 2020: Hello are you listening, There is stream for everything
  • Spring Cloud Function on Azure run locally
  • Deploy Spring Cloud Function to IBM Cloud
  • Speaking (again) at Engage in a Zoo
  • Congratulations, you’re an IBM Champion (again)!

Latest reactions

  • Spring Cloud Function on Azure run locally - elstar IT on Deploy Spring Cloud Function to IBM Cloud
  • flinden68 on Quick XPages tip: add Fullcalendar plugin to your application
  • Rajesh samal on Quick tip: Swagger support for Spring Webflux
  • dsieyx on Quick XPages tip: add Fullcalendar plugin to your application
  • John on Named as IBM Champion 2019

Archive

  • March 2020
  • February 2020
  • January 2020
  • October 2019
  • September 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • January 2019
  • December 2018
  • October 2018
  • September 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • December 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014

Category

  • bluemix
  • business
  • cloudant
  • community
  • development
  • hrassistant
  • openntf
  • running
  • salesforce
  • Springboot
  • Tesla
  • trailrunning
  • Uncategorized
  • watson
  • OpenNTF
  • Collaboration Today
  • XSnippets
  • Stackoverflow
  • IBM Collaboration Solutions
  • Social Business Toolkit
  • About me
  • Dutch curriculum vitae
  • English curriculum vitae
  • Google+
  • LinkedIn profile
  • Twitter
  • Slideshare
  • Blog license
  • My open source projects