<![CDATA[%s]]>

## Making an Atom Feed using Selenium

Making an Atom Feed using Selenium.

by

Christoph Lohmann <20h@r-36.net>

## Intro

* The web is getting more complex.

* You have pure javascript framework auto-generated websites with nothing
  to parse from.

* The only way to get any conten from it is to execute the javascript.

* Sadly we need that part of a browser from it.
	* There is not just some javascript engine.
	* All is interwingeled.
	* Google wanted it that way.

## Basic Atom Feed Generation

{
	printf '<?xml version="1.0" encoding="utf-8"?>\n'
	printf '<feed xmlns="http://www.w3.org/2005/Atom">\n'
	printf '<updated>%s</updated>\n' "$(date "+%FT%T%z")"

	hurl $uri \
	| grep content | sed 's,rawcontent,content,g' \
	| while read -r line;
	do
		printf "<entry>"
		printf "<content><![CDATA[%s]]></content>" "${line}"
		printf "</entry>\n"
	done

	printf "</feed>\n"
} > somefeed.atom

## How it evolved.

* frameworks like python requests
* webkit
	* Small browser evolved and scraping.
	* PhantomJS

--> They became outdated due to the speed web engines evolved.
	--> Feature bloat.
	--> Corporate need for new things besides no need for it.
	--> Sell more products.

* Intermediate steps followed of complex control protocols.
	* I will skip them for so you stay sane.

## Current State: WebDriver

https://w3c.github.io/webdriver/

> WebDriver is a remote control interface that enables introspection and
> control of user agents. It provides a platform- and language-neutral
> wire protocol as a way for out-of-process programs to remotely instruct
> the behavior of web browsers. 

Webbrowsers expose HTTP endpoints:

	POST /session/...
	DELETE /session/...
	GET /session/...

* Could be wrapped into C too.
* For fast prototyping we use selenium and python.

## Selenium Environment

1. Get Selenium

	$ pip install selenium
	# Huge bloat is installed.

2. Get a Chromium WebDriver

Normally included in your chromium installation at:
	/usr/bin/chromedriver

Or:
Gentoo:		emerge www-apps/chromedriver-bin
Binary package: https://chromedriver.chromium.org/downloads

## Selenium Environment

Other Web Browsers:

* Edge
* Firefox
* Internet Explorer
* Safari

All have their quirks:

https://www.selenium.dev/documentation/webdriver/browsers/

## Basic Selenium Script

	#!/bin/env python
	from selenium import webdriver
	from selenium.webdriver.common.by import By
	driver = webdriver.Chrome()
	driver.get("https://www.bitreich.org")
	driver.implicitly_wait(1.0)
	driver.find_element(By.XPATH, "//*[@class=\"proletariat\"]").text

Output: gophers://bitreich.org

## Selenium IDE

https://www.selenium.dev/selenium-ide/

Browser Extension for Firefox and Chromium to record interactions with
websites.
	* Easily generate scripts from that.

## Select Content in Websites

driver.find_element or driver.find_elements

	<p><elem id="text" class="info" meta="subcontext">
		text to grep
	</elem></p>

e = driver.find_element(By.ID, "text").text
e = driver.find_element(By.TAG_NAME, "elem").text
e = driver.find_element(By.CLASS_NAME, "info").text
e = driver.find_element(By.XPath, "//p/elem").text

Others: By.NAME (forms), By.CSS_SELECTOR, By.LINK_TEXT,
        By.PARTIAL_LINK_TEXT

e.get_attribute("meta")

## Stuff we won't handle here.

Selenium can do:
	* input
		* fill out forms
			* key presses emulation
			* upload files
		* send forms
		* scroll web pages
		* do pen actions (tablet)
		* mouse emulation
			* drag and drop elements
	* history / navigate around in the browser

## Stuff we won't handle here.

Selenium can do:
	* window manipulation
		* handle multiple tabs / windows
		* handle iframes
		* move windows around
		* take screenshots
	* print websites
	* handle popup alerts
	* set / get cookies
	* let you run inline javascript
	* do color animations
	* debug javascript for you using the bidirectional protocol
	* build huge action chains for time-perfect handling

## Complex Example

https://www.kvsachsen.de/
* 'new modern' website of my doctor association.
* All in a javascript framework.
* News is hidden by loading even more javascript.
* No rss feed.

News:
1. Open https://www.kvsachsen.de/fuer-praxen/aktuelle-informationen/praxis-news
2. Parse Javascript in subframe.

## Complex Example

Get stuff ready:

	from selenium import webdriver
	from selenium.webdriver.chrome.options import Options as chromeoptions
	from selenium.webdriver.support.ui import WebDriverWait
	from selenium.webdriver.support import expected_conditions as EC
	from selenium.webdriver.common.by import By
	from datetime import datetime
	import pytz

	link = "https://www.kvsachsen.de/fuer-praxen/\
		aktuelle-informationen/praxis-news"

## Complex Example

Get ChromeDriver ready:

	options = chromeoptions()
	chromearguments = [
		"headless",
		"no-sandbox",
		"disable-extensions",
		"disable-dev-shm-usage",
		"start-maximized",
		"window-size=1900,1080",
		"disable-gpu"
	]
	for carg in chromearguments:
		options.add_argument(carg)

	driver = webdriver.Chrome(options=options)

## Complex Example

Get the content:

	driver.get(link)

## Complex Example

Wait for the content to be ready and loaded with a timeout
of 60 seconds:

	isnews = WebDriverWait(driver=driver, timeout=60).until(
			EC.presence_of_element_located((By.XPATH,
				"//div[@data-last-letter]")
			)
	)

EC ... Expected Condition
EC can be very many things:

	https://www.selenium.dev/selenium/docs/api/py/\
		webdriver_support/\
		selenium.webdriver.support.expected_conditions.html

Pro Tip: Do not wait for a static time, use some EC for this. You will 
be safer and have less errors.

## Complex Example

Get the root news element we work from:

	newslist = driver.find_elements(By.XPATH,
		"//div[@data-filter-target=\"list\"]")[0]

Get some metadata for the atom feed:

	title = driver.find_elements(By.XPATH,
		"//meta[@property=\"og:title\"]")[0].\
		get_attribute("content")
	description = title

## Complex example

Print the header of the atom feed to stdout:

	print("""<?xml version="1.0" encoding="utf-8"?>""")
	print("""<feed xmlns="http://www.w3.org/2005/Atom">""")
	print("\t<title><![CDATA[%s]]></title>" % (title))
	print("\t<subtitle><![CDATA[%s]]></subtitle>" % (description))
	print("\t<id>%s</id>" % (link))
	print("\t<link href=\"%s\" rel=\"self\" />" % (link))
	print("\t<link href=\"%s\" />" % (link))

Use the current data for updated:

	utcnow = datetime.now(pytz.utc)
	print("\t<updated>%s</updated>" % (utcnow.isoformat()))

## Complex example

Get the entries:

	articles = newslist.find_elements(By.XPATH, "./div")

Prepare a base URI for appending relative links:

	baselink = "/".join(link.split("/", 3)[:-1])

Loop over all entries in backwards style:

	for article in articles[::-1]:

## Complex example

Find the deep link to the article:

		link = article.find_elements(By.XPATH, "./a")[0]
		plink = link.get_attribute("href")

Normalize the link in case it is relative:

		if not plink.startswith("http"):
			plink = "%s/%s" % (baselink, plink)

Get the entry title, content and set an absolute author:

		ptitle = link.get_attribute("data-title")
		pcontent = article.text
		pauthor = "sachsen@kvsachsen.de"
	
## Complex example

Parse the datetime for the article release:

		updateds = article.find_elements(By.XPATH, ".//time")[0].text
		try:
			dtupdated = datetime.strptime(updateds, "%d.%m.%Y")
		except ValueError:
			continue

Bring the datetime into python native format for further processing:

		dtupdated = dtupdated.replace(hour=12, minute=0,\
				second=0, tzinfo=pytz.utc)
		if dtupdated.year > utcnow.year:
			dtupdated = dtupdated.replace(year=utcnow.year)
		pupdated = dtupdated

## Complex example

Print the entry:

		print("\t<entry>")
		print("\t\t<id>%s</id>" % (plink))
		print("\t\t<title><![CDATA[%s]]></title>" % (ptitle))
		print("\t\t<link href=\"%s\" />" % (plink))
		print("\t\t<author><name>%s</name></author>" % (pauthor))
		print("\t\t<updated>%s</updated>" % (pupdated.isoformat()))
		print("\t\t<content><![CDATA[%s]]></content>" % (pcontent))
		print("\t</entry>")

Print the footer (out of feeds loop):

	print("</feed>")

## Example Script

The full example script and how I use it can be found in:

	git://bitreich.org/brcon2023-hackathons
	./sfeed-atom/kvssachsen2atom

## Summary

* With Selenium you can script all of modern web in all ways.

* We fight bloat with bloat.
	* You run a full web process to parse the web.
	* We wanted to avoid that with scraping.

* You can easily prototype web access in for example ipython(1).

* There are still privacy concerns.
	* You run a huge blob of hundreds of thousands of sloc.
	* Plato's cave allegory

## Plato's cave allegory

+--------------;,,.;        ..\.|./,.
|                       .------(_)------
|#         #  (too bright!) - /,|.\
|#     o  =|       o/     ,  /. |. .(
|#    o|o =|o      |       / , .|, (_|
|     | | = |     ,..,            ___|_,
+---------+----."'    ''''''~~~~~~\____|~~~

* People are in a cave, watching the shadow figure of a hash,
  presented to them by a narrator behind the wall to the exit
  of the cave.
* When people want to leave the cave, they will be blinded by
  the sun. The sunlight hurts their eyes. The will go back into
  the have.
* The outside did not look so fine presented and prepared as
  the shadow of the narrator's play. It does not hurt the eye.
* Only some people are able to adapt their eyes and see the
  beauty of not being dependent on a narrator. They will be able
  to leave the cave.

## Questions?

Do you have questions?

## Thanks

Thank you for listening.

For further suggestions, contact me at

	Christoph Lohmann <20h@r-36.net>