Comments and answers for "Scraping dynamic web page data"

Answer by dbaldacchino1

dbaldacchino1 — Sat, 13 Oct 2018 00:01:59 GMT

Ok so I was able to get to the data without additional tools.Lucky?Maybe :) Took some sleuthing, but here's how I got to what I needed for the example above:

I saved the web page but selected the option that I just noticed in Chrome "Webpage, Complete"
I opened this in Notepad++ and tried reading through to see if I could find anything helpful.I noticed that the tag I wanted was "h2", but most importantly, I noticed this around the middle of the saved page:
I tested by reconstructing the URL as "https://help.autodesk.com/cloudhelp/ENU/CONNECT/files/GUID-03D59AAD-65B0-45E3-84F2-A12AAA5BB267.htm" and the page loaded (URL re-directed to the original one)
In FME I used an HTMLExtractor on this new URL:
And BAAM!As happy as can be :)

I guess when there's a will, there's a way!

Answer by dbaldacchino1

dbaldacchino1 — Fri, 12 Oct 2018 14:19:25 GMT

Thanks a lot@revesz.My Python skills are...uhm..copy & paste mostly :) So I would definitely appreciate you sharing any of your content and perhaps use it to learn more.My goal for this project is to scrape the text to know when a new version of the software is available.I have done this on other sites but they either have some table exposed and I can get that directly in FME, or there is an API available which I was able to figure out from the page source itself.

Answer by revesz

revesz — Fri, 12 Oct 2018 10:18:10 GMT

Unfortunately it is a task of a web browser or at least a rendering engine.

There are Python solutions to call a browser to do the rendering stuff and get the requied information from the rendered DOM.

One of them is theSelenium package.It can even call web browsers in headless mode.A headless Chrome discussion is here as astarting point.

This solution requires toinstall the package in FME pythonandChromedriverexecutable to a folder which is visible by the Python script.

I'm working on a rendering solution but it is not a high priority so it may take a week ir so, however I'm happy to share it when it is reasonably stable.