From: dbucklin@sdf.org Date: 2018-04-17 Subject: Evernote Extraction I take notes all the time. I love having access to my notes wher- ever I go. Evernote does that. However, I've become increasingly dissatisfied with the complexity of their client software. Also, they recently stopped supporting Geeknote, a CLI client. [1] Gee- knote has it's own problems, so maybe it's time to make a change. After evaluating a number of solutions, I settled on vimwiki. [2] Vimwiki will let me manage my information in plaintext and I can even publish an HTML version of it. My entire collection of notes should be small enough that I can pull everything down to my phone. Now I just have to extract my data from Evernote. Easy, right? Evernote doesn't make a desktop client for linux, so I fired up my Mac Mini since I need to use the desktop client to export my data. I exported each of my notebooks into a separate enex file (Ever- note's XML format). Looking at it, I wonder if it's even valid XML. How am I going to get my data out of here? My first move is to install html-xml-utils. After experimenting with `hxpipe` and `hxextract`, it seems like html-xml-utils are more about manipulating html/xml and retaining the format, not fil- tering the data away from the format. I had a quick chat with tomasino [3] and he referred me to ev- er2simple [4]. Ever2simple is a tool that aims to help people mi- grate from Evernote to Simplenote. After some trial and error, I was able to install ever2simple, but I first had to install python- pip, python-libxml2, python-lxml, and python-lxslt. I'm starting with one of my smallest notebooks, a journal, just so I can prove the concept. I want to migrate these journal entries to my journal.txt file that I maintain with jrnl. [5] I tried the `-f dir` option first, hoping this would just give me a folder full of text files. That's exactly what it does, but there's no metada- ta. I need the timestamps. Using ever2simple with the `-f json` option gives me my metadata, but now everything is in a huge JSON stream. After some experimentation with sed, I conclude that sed is not the right tool for this job. I remember hearing about something called `jq` that should let me work with JSON. The apt package description for `jq` starts with, "jq is like sed for JSON...". Well, I'm sold. Also, no dependen- cies! What a bonus. The man page is full of explanations and exam- ples, but I'm going to need to experiment with the filters. After some experimentation, I land on jq '.[] | .createdate,.content' journal.json This cycles through each top-level element and extracts the create- date and content values. Now I wonder how I can add a separator so that I can dissect the data into discrete files with awk or some- thing. I should be able to add a literal to the list of filters. jq '.[] | .createdate,.content,"%%"' journal.json Well, the %% lines include the quotes, but that's not the end of the world. I wonder what date format I need for jrnl. Each jrnl entry starts with YYYY-MM-DD HH:MM Title Evernote gives me dates that look like Jul 25 2011 HH:MM:SS `date --help` to the rescue! Looking at date handling in `jq`, I should be able to convert the dates from the format used by Evernote to the format used by jrnl with the filter strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M") All together, then. jq '.[] | (.createdate|strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")),.content,"%%"' journal.json I still have some garbage in there, but I'm getting close to being able to just prepend this to my journal.txt file. OK, I'm close enough with this: jq '.[] | (.createdate|strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")),.content,"%%"' journal.json | sed -e 's/^"//;s/"$//;s/\\n/\n/g' | sed -e '/^ *$/d' >journal.part Okay, let's try the recipes notebook. My recipes notebook should be a little more challenging than my journal entries, but it's not as massive as my main notebook. ever2simple -f json -o recipes.json recipes.enex My journal json file was 5k. This one is 105k. Running the same command as before gives me pretty legible output. I know some of these notes had attachments, but I don't see them in the JSON. I wonder if they are mime-encoded in the XML file. Looking back at my recipes.enex file, attachments do appear to be base64 encoded in the XML, but ever2simple doesn't copy this data into the JSON file it creates. This makes sense since its target is Simplenote. Maybe html-xml-utils can help me get these files out. hxextract 'resource' recipes.enex It looks like the files are encapsulated within resource elements. The resource element contains metadata about the attachment and the base64-encoded data itself is inside a data element. I can isolate the data using hxselect. hxselect -c -s '\n\n' data < recipes.enex > recipes.dat This gives me all the mime attachments in a single file. Each base64-encoded file is separated by two newlines. This doesn't preserve my metadata, but I'm anxious to get the data out and see what's in there. Let's see if I can pipe the first one into base64 -d to decode it. An awk one-liner should let me terminate output at the first blank line. awk '/^$/ {exit}{print $0}' recipes.dat | base64 -d > testfile Now I can use `file` to find out what kind of file it is. file testfile This tells me that it's an image. A JPEG, to be specific, and it's 300 dpi and 147x127. That seems seems small. I wonder if Evernote encoded all of the images that were in the html pages I saved. Opening the file in an image viewer, I can see that that's exactly what it is. How many attachments are in there? Could I... sed -e '/^./d' recipes.dat | wc Damn, that's slick. There are 74 files in there. I'll bet only a handful of them have any value to me. I think the easiest way to go forward is to copy each base64 attachment into it's own file. Looking at split(1), it splits on line count, not a delimiter. What if I do something like... #!/usr/bin/awk -f BEGIN {fcount=1} /^$/ {fcount++;} { print $0 >> "dump/" fcount ".base64"} This goes through my recipes.dat file and puts each base64-encoded attachment into its own file. Now I need to decode them and give them an appropriate suffix. #/bin/bash for f in dump/* do outfile="${f%.*}.out" base64 -d "${f}" > "${outfile}" type=$(file ${outfile}) type="${type#* }" type="${type%% *}" newout="${outfile%.out}.${type}" mv "$outfile" "$newout" done Phew! Now I have 74 files to look through. Most of these are garbage from web pages I saved. There's really only five of these that I want to keep. There are a few problems with this approach: * I lose the original file name. * I use the file utility to reconstruct the filename extension. * I lose the association between the file and the note. This has been a lot of work, and there's a lot more to be done. Looking at my main notebook, I may revisit ever2simple's `-f dir` option. I could even look at the source and see if there's a way to tack on metadata. I assume there are better ways to go about this, but I love chal- lenges like this because it's an excuse to learn new tools and get better at using the tools I'm already familiar with. Next time, I'll show you what happens next, and how I migrate this information to vimwiki. ## References 1. http://www.geeknote.me/ 2. https://vimwiki.github.io/ 3. gopher://gopher.black 4. https://vimwiki.github.io/ 5. http://jrnl.sh