Python isn't really the right tool for parsing org-mode files. Ideally I should be using elisp, along with all the functions and regular expressions that work behind the scenes to orchestrate org-modes functionality. Unfortunately my fledgling understanding of the Org-Mode's mapping api seems likely to fail me and rather than end up yak shaving I'm going to use python. I guess the 'right tool for the job' is often the tool you know best; which is probably why that phrase is usually heard right before someone makes a discursive ideological argument about their new favorite tool-chain.
Of course there is prior art for parsing Org-Mode files in python. I could probably use PyOrgMode to do this but when I cloned the repo and tried to parse a some of my files it choked. "It's probably a quick fix to figure out why it didn't work…" is the thing I usually say right before a 5 hour marathon into someone else's code base. I made a note to look into it and wrote this (quick and ugly) function instead:
from collections import deque
def parse_headings(headings):
"""Clean up, some headings have ":" and * and other junk in them."""
for heading in headings:
# If we got an empty heading for some reason get rid of it
if heading != '':
hqueue = deque(heading.split(":"))
# egrep will return a filename and the line seperated by a :
# luckily this lines up with org's tag system
fn = hqueue.popleft()
head = []
tags = []
# If there are tags last character will be a ':'
# this check keeps us from parsing headlines with urls at
# the end of them
if heading[-1] == ":":
while len(hqueue):
t = hqueue.pop()
# append to tags unless we've got an empty string
# or a queue item with spaces in it (org-mode tags
# can't have spaces)
if " " not in t and t != '':
tags.append(t)
elif t != '':
hqueue.append(t)
break
# Re-join our head
head = ":".join(hqueue)
# Get number of stars
stars, head = len(head.split(" ")[0]), " ".join(head.split(" ")[1:])
yield (fn, stars, head, ",".join(tags))
This makes it easy to parse headings by using subprocess's check_output() function and egrep
to pull lines that start with one or more '*' followed by a space:
import subprocess
headings = subprocess.check_output('egrep "^[*]+ " *.org', shell=True).split("\n")
Headings can then be parsed using pandas:
import pandas as pd
df = pd.DataFrame(parse_headings(headings),
columns=["FileName", "Level", "Heading", "Tags"])
ti = df['Tags'].str.get_dummies(sep=",").astype(bool)
This gives me a data frame with four columns, FileName, Level, Heading and Tags. I use the core strings method get_dummies() to create a data frame of tag indicator variables so we can look at tag specific subgroups and do basic tag analysis.