Python isn't really the right tool for parsing org-mode files. Ideally I should be using elisp, along with all the functions and regular expressions that work behind the scenes to orchestrate org-modes functionality. Unfortunately my fledgling understanding of the Org-Mode's mapping api seems likely to fail me and rather than end up yak shaving I'm going to use python. I guess the 'right tool for the job' is often the tool you know best; which is probably why that phrase is usually heard right before someone makes a discursive ideological argument about their new favorite tool-chain.

Of course there is prior art for parsing Org-Mode files in python. I could probably use PyOrgMode to do this but when I cloned the repo and tried to parse a some of my files it choked. "It's probably a quick fix to figure out why it didn't work…" is the thing I usually say right before a 5 hour marathon into someone else's code base. I made a note to look into it and wrote this (quick and ugly) function instead:

from collections import deque

def parse_headings(headings):
    """Clean up,  some headings have ":" and * and other junk in them."""
    for heading in headings:
        # If we got an empty heading for some reason get rid of it
        if heading != '':
            hqueue = deque(heading.split(":"))
            # egrep will return a filename and the line seperated by a :
            # luckily this lines up with org's tag system
            fn = hqueue.popleft()
            head = []
            tags = []
            # If there are tags last character will be a ':'
            # this check keeps us from parsing headlines with urls at
            # the end of them
            if heading[-1] == ":":
                while len(hqueue):
                    t = hqueue.pop()
                    # append to tags unless we've got an empty string
                    # or a queue item with spaces in it (org-mode tags
                    # can't have spaces)
                    if " " not in t and t != '':
                        tags.append(t)
                    elif t != '':
                        hqueue.append(t)
                        break

            # Re-join our head
            head = ":".join(hqueue)
            # Get number of stars
            stars, head = len(head.split(" ")[0]), " ".join(head.split(" ")[1:])
            yield (fn, stars, head, ",".join(tags))

This makes it easy to parse headings by using subprocess's check_output() function and egrep to pull lines that start with one or more '*' followed by a space:

import subprocess
headings = subprocess.check_output('egrep "^[*]+ " *.org', shell=True).split("\n")

Headings can then be parsed using pandas:

import pandas as pd

df = pd.DataFrame(parse_headings(headings),
                  columns=["FileName", "Level", "Heading", "Tags"])
ti = df['Tags'].str.get_dummies(sep=",").astype(bool)

This gives me a data frame with four columns, FileName, Level, Heading and Tags. I use the core strings method get_dummies() to create a data frame of tag indicator variables so we can look at tag specific subgroups and do basic tag analysis.