From: dbucklin@sdf.org Date: 2018-06-29 Subject: Visualizing the History of Programming Languages Recently, I came across the Wikipedia article, Timeline of Program- ming Languages[1]. It has nicely-formatted tables for each decade since the 1940s. Each table has the same format: one language per row and, for each language, the year it came into being, its name, creator, and a list of the languages that influenced it. When data establishes a relationship between elements, as this page does with the list of influences for each language, we have the makings of a graph. I've written before[2] about using PlantUML to create graphics from textual information. In this case I'm using a similar tool, Plant- Text[3], to create this graph. Both PlantUML and PlantText are web-based front-ends for GraphViz. GraphViz uses a language, called DOT, to define the content and structure of information to be visualized. PlantText provides a number of examples that illus- trate how different kinds of visualizations can be created. I'm basing this graph on the World Dynamics template. I also got some additional ideas from "Drawing graphs with dot"[4] by Gansner, Koutsofios, and North. The first thing I need to do is put the tables into a structured format that I can parse with awk. I'm using git-bash for this, and git-bash lacks my go-to tools for something like this, lynx or w3m. I do have Pandoc installed, and I can use that to transform HTML to plain text. I recently created a function called myw3m in my .bash_profile. myw3m () { curl -s "$1" | perl -pe 'use open qw(:std :utf8); s/[^[:ascii:]]//g;' | pandoc -fhtml -tplain - } This function pulls down the page using curl, passes it to perl to strip out non-ascii characters, and then to Pandoc to transform to plain text. I can pull the Wikipedia page to a local file, con- verting it to plain text along the way: myw3m https://en.wikipedia.org/wiki/Timeline_of_programming_languages > timeline.txt This works fine in git-bash, but on my Debian VPS, I have an older version of Pandoc that wraps over-zealously. I am able to get sim- ilar results with w3m, using a large width value to prevent wrap- ping. This also requires a little manual futzing with the data to widen inter-column space. w3m -cols 400 -dump https://en.wikipedia.org/wiki/Timeline_of_programming_languages > timeline.txt Now I have the whole page as plain text including the table rows which look like this: 1970 Pascal Niklaus Wirth, Kathleen Jensen ALGOL 60, ALGOL W I want to pass this into awk, so I need to put this data into tab- delimited fields. First, I need to strip out any lines that are not data I need to process. This is pretty easy because each line from the table starts with two spaces and a four-digit number. I can filter out everything else using sed: sed -ne '/^ [[:digit:]]4/p' It looks like Pandoc has separated each column with three or more spaces. We are lucky that no field data contains three consecutive spaces. This means that we can use sed to replace any instance of three or more spaces with a tab. We should also strip off leading spaces while we're at it. We are trying to make this easy to work with in awk. sed -e 's/^ *//;s/ *//g' Now a row in the table looks like this: 1970^IPascal^INiklaus Wirth, Kathleen Jensen^IALGOL 60, ALGOL W Pretty slick, no? There are three sections of graph that I need to build out. My first subgraph is the collection of years in the da- ta with the relationship between each year made explicit. This ef- fectively creates an X-axis that will organize the rest of the graph. By building a list of all the years, I can create a sub- graph that looks like this: { "1988" -> "1989"; "1989" -> "1990"; "1990" -> "1991"; } The next thing I need to do is tell GraphViz to associate each lan- guage with its corresponding year in the subgraph. This will en- courage GraphViz to visually rank each language along with its year. I do this with a series of subgraphs that look like this: {rank=same; "1988"; "rpg/400"; "tcl"; "stos basic"; "actor"; "object rexx"; "spark"; "a+"; "hamilton c shell"; } {rank=same; "1989"; "turbo pascal oop"; "modula-3"; "powerbasic"; "lpc"; "bash"; "magik"; "python"; } {rank=same; "1990"; "amos basic"; "object oberon"; "j"; "haskell"; "z shell"; } You'll notice that everything is in lower case. I found the capi- talization to be inconsistent within the Wikipedia article, so I have to fix that to avoid duplicates. Finally, I just need to define all the other nodes and edges. The edges are directional, so I define them going from the influencing language to the influenced language. For a line like this: 1987 Perl Larry Wall C, sed, awk, sh I create DOT statements like this: "c" -> "perl"; "sed" -> "perl"; "awk" -> "perl"; "sh" -> "perl"; The finished product[5]. I had to cut it off after 2005 because GraphViz would crash on me. You can also see the awk script[6] used to create the data that I fed to PlantText. I had a lot of fun with this challenge. Projects like this are a great way to learn. I learned more about DOT and awk while doing this. It's also fulfilling to see the potential in something and then make it happen. Happy hacking. References: 1. https://en.wikipedia.org/wiki/Timeline_of_programming_languages 2. http://davebucklin.com/work/2017/09/11/diagrams-from-text-with-plantuml.html 3. https://www.planttext.com 4. http://www.graphviz.org/pdf/dotguide.pdf 5. http://davebucklin.com/assets/img/lang6.png 6. http://davebucklin.com/assets/toplg.txt