## Making an Atom Feed using Selenium Making an Atom Feed using Selenium. by Christoph Lohmann <20h@r-36.net> ## Intro * The web is getting more complex. * You have pure javascript framework auto-generated websites with nothing to parse from. * The only way to get any conten from it is to execute the javascript. * Sadly we need that part of a browser from it. * There is not just some javascript engine. * All is interwingeled. * Google wanted it that way. ## Basic Atom Feed Generation { printf '\n' printf '\n' printf '%s\n' "$(date "+%FT%T%z")" hurl $uri \ | grep content | sed 's,rawcontent,content,g' \ | while read -r line; do printf "" printf "" "${line}" printf "\n" done printf "\n" } > somefeed.atom ## How it evolved. * frameworks like python requests * webkit * Small browser evolved and scraping. * PhantomJS --> They became outdated due to the speed web engines evolved. --> Feature bloat. --> Corporate need for new things besides no need for it. --> Sell more products. * Intermediate steps followed of complex control protocols. * I will skip them for so you stay sane. ## Current State: WebDriver https://w3c.github.io/webdriver/ > WebDriver is a remote control interface that enables introspection and > control of user agents. It provides a platform- and language-neutral > wire protocol as a way for out-of-process programs to remotely instruct > the behavior of web browsers. Webbrowsers expose HTTP endpoints: POST /session/... DELETE /session/... GET /session/... * Could be wrapped into C too. * For fast prototyping we use selenium and python. ## Selenium Environment 1. Get Selenium $ pip install selenium # Huge bloat is installed. 2. Get a Chromium WebDriver Normally included in your chromium installation at: /usr/bin/chromedriver Or: Gentoo: emerge www-apps/chromedriver-bin Binary package: https://chromedriver.chromium.org/downloads ## Selenium Environment Other Web Browsers: * Edge * Firefox * Internet Explorer * Safari All have their quirks: https://www.selenium.dev/documentation/webdriver/browsers/ ## Basic Selenium Script #!/bin/env python from selenium import webdriver from selenium.webdriver.common.by import By driver = webdriver.Chrome() driver.get("https://www.bitreich.org") driver.implicitly_wait(1.0) driver.find_element(By.XPATH, "//*[@class=\"proletariat\"]").text Output: gophers://bitreich.org ## Selenium IDE https://www.selenium.dev/selenium-ide/ Browser Extension for Firefox and Chromium to record interactions with websites. * Easily generate scripts from that. ## Select Content in Websites driver.find_element or driver.find_elements

text to grep

e = driver.find_element(By.ID, "text").text e = driver.find_element(By.TAG_NAME, "elem").text e = driver.find_element(By.CLASS_NAME, "info").text e = driver.find_element(By.XPath, "//p/elem").text Others: By.NAME (forms), By.CSS_SELECTOR, By.LINK_TEXT, By.PARTIAL_LINK_TEXT e.get_attribute("meta") ## Stuff we won't handle here. Selenium can do: * input * fill out forms * key presses emulation * upload files * send forms * scroll web pages * do pen actions (tablet) * mouse emulation * drag and drop elements * history / navigate around in the browser ## Stuff we won't handle here. Selenium can do: * window manipulation * handle multiple tabs / windows * handle iframes * move windows around * take screenshots * print websites * handle popup alerts * set / get cookies * let you run inline javascript * do color animations * debug javascript for you using the bidirectional protocol * build huge action chains for time-perfect handling ## Complex Example https://www.kvsachsen.de/ * 'new modern' website of my doctor association. * All in a javascript framework. * News is hidden by loading even more javascript. * No rss feed. News: 1. Open https://www.kvsachsen.de/fuer-praxen/aktuelle-informationen/praxis-news 2. Parse Javascript in subframe. ## Complex Example Get stuff ready: from selenium import webdriver from selenium.webdriver.chrome.options import Options as chromeoptions from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from datetime import datetime import pytz link = "https://www.kvsachsen.de/fuer-praxen/\ aktuelle-informationen/praxis-news" ## Complex Example Get ChromeDriver ready: options = chromeoptions() chromearguments = [ "headless", "no-sandbox", "disable-extensions", "disable-dev-shm-usage", "start-maximized", "window-size=1900,1080", "disable-gpu" ] for carg in chromearguments: options.add_argument(carg) driver = webdriver.Chrome(options=options) ## Complex Example Get the content: driver.get(link) ## Complex Example Wait for the content to be ready and loaded with a timeout of 60 seconds: isnews = WebDriverWait(driver=driver, timeout=60).until( EC.presence_of_element_located((By.XPATH, "//div[@data-last-letter]") ) ) EC ... Expected Condition EC can be very many things: https://www.selenium.dev/selenium/docs/api/py/\ webdriver_support/\ selenium.webdriver.support.expected_conditions.html Pro Tip: Do not wait for a static time, use some EC for this. You will be safer and have less errors. ## Complex Example Get the root news element we work from: newslist = driver.find_elements(By.XPATH, "//div[@data-filter-target=\"list\"]")[0] Get some metadata for the atom feed: title = driver.find_elements(By.XPATH, "//meta[@property=\"og:title\"]")[0].\ get_attribute("content") description = title ## Complex example Print the header of the atom feed to stdout: print("""""") print("""""") print("\t<![CDATA[%s]]>" % (title)) print("\t" % (description)) print("\t%s" % (link)) print("\t" % (link)) print("\t" % (link)) Use the current data for updated: utcnow = datetime.now(pytz.utc) print("\t%s" % (utcnow.isoformat())) ## Complex example Get the entries: articles = newslist.find_elements(By.XPATH, "./div") Prepare a base URI for appending relative links: baselink = "/".join(link.split("/", 3)[:-1]) Loop over all entries in backwards style: for article in articles[::-1]: ## Complex example Find the deep link to the article: link = article.find_elements(By.XPATH, "./a")[0] plink = link.get_attribute("href") Normalize the link in case it is relative: if not plink.startswith("http"): plink = "%s/%s" % (baselink, plink) Get the entry title, content and set an absolute author: ptitle = link.get_attribute("data-title") pcontent = article.text pauthor = "sachsen@kvsachsen.de" ## Complex example Parse the datetime for the article release: updateds = article.find_elements(By.XPATH, ".//time")[0].text try: dtupdated = datetime.strptime(updateds, "%d.%m.%Y") except ValueError: continue Bring the datetime into python native format for further processing: dtupdated = dtupdated.replace(hour=12, minute=0,\ second=0, tzinfo=pytz.utc) if dtupdated.year > utcnow.year: dtupdated = dtupdated.replace(year=utcnow.year) pupdated = dtupdated ## Complex example Print the entry: print("\t") print("\t\t%s" % (plink)) print("\t\t<![CDATA[%s]]>" % (ptitle)) print("\t\t" % (plink)) print("\t\t%s" % (pauthor)) print("\t\t%s" % (pupdated.isoformat())) print("\t\t" % (pcontent)) print("\t") Print the footer (out of feeds loop): print("") ## Example Script The full example script and how I use it can be found in: git://bitreich.org/brcon2023-hackathons ./sfeed-atom/kvssachsen2atom ## Summary * With Selenium you can script all of modern web in all ways. * We fight bloat with bloat. * You run a full web process to parse the web. * We wanted to avoid that with scraping. * You can easily prototype web access in for example ipython(1). * There are still privacy concerns. * You run a huge blob of hundreds of thousands of sloc. * Plato's cave allegory ## Plato's cave allegory +--------------;,,.; ..\.|./,. | .------(_)------ |# # (too bright!) - /,|.\ |# o =| o/ , /. |. .( |# o|o =|o | / , .|, (_| | | | = | ,.., ___|_, +---------+----."' ''''''~~~~~~\____|~~~ * People are in a cave, watching the shadow figure of a hash, presented to them by a narrator behind the wall to the exit of the cave. * When people want to leave the cave, they will be blinded by the sun. The sunlight hurts their eyes. The will go back into the have. * The outside did not look so fine presented and prepared as the shadow of the narrator's play. It does not hurt the eye. * Only some people are able to adapt their eyes and see the beauty of not being dependent on a narrator. They will be able to leave the cave. ## Questions? Do you have questions? ## Thanks Thank you for listening. For further suggestions, contact me at Christoph Lohmann <20h@r-36.net>