Computational Linguistics

Demystifying NLP

Alpha release of spaCy: A library for industrial-strength NLP with Python/Cython

I’ve been working on this full-time for about six months, so it’s exciting to finally release it.

Discuss/praise/mock/upvote/etc on Hacker News or /r/programming

2015/01/25 Posted by | Uncategorized | Leave a comment

Writing C in Cython

For the last two years, I’ve done almost all of my work in Cython. And I don’t mean, I write Python, and then “Cythonize” it, with various type-declarations etc. I just, write Cython. I use “raw” C structs and arrays, and occasionally C++ vectors, with a thin wrapper around malloc/free that I wrote myself. The code is almost always exactly as fast as C/C++, because it really is just C/C++ with some syntactic sugar — but with Python “right there”, should I need/want it.

This is basically the inverse of the old promise that languages like Python came with: that you would write your whole application in Python, optimise the “hot spots” with C, and voila! C speed, Python convenience, and money in the bank.

This was always much nicer in theory than practice. In practice, your data structures have a huge influence on both the efficiency of your code, and how annoying it is to write. Arrays are a pain and fast; lists are blissfully convenient, and very slow. Python loops and function calls are also quite slow, so the part you have to write in C tends to wriggle its way up the stack, until it’s almost your whole application.

Today a post came up on HN, on writing C extensions for Python. The author wrote both a pure Python implementation, and a C implementation, using the Numpy C API. This seemed a good opportunity to demonstrate the difference, so I wrote a Cython implementation for comparison:

import random
from cymem.cymem cimport Pool

from libc.math cimport sqrt

cimport cython

cdef struct Point:
    double x
    double y

cdef class World:
    cdef Pool mem
    cdef int N
    cdef double* m
    cdef Point* r
    cdef Point* v
    cdef Point* F
    cdef readonly double dt
    def __init__(self, N, threads=1, m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
        self.mem = Pool()
        self.N = N
        self.m = <double*>self.mem.alloc(N, sizeof(double))
        self.r = <Point*>self.mem.alloc(N, sizeof(Point))
        self.v = <Point*>self.mem.alloc(N, sizeof(Point))
        self.F = <Point*>self.mem.alloc(N, sizeof(Point))
        for i in range(N):
            self.m[i] = random.uniform(m_min, m_max)
            self.r[i].x = random.uniform(-r_max, r_max)
            self.r[i].y = random.uniform(-r_max, r_max)
            self.v[i].x = random.uniform(-v_max, v_max)
            self.v[i].y = random.uniform(-v_max, v_max)
            self.F[i].x = 0
            self.F[i].y = 0
        self.dt = dt

def compute_F(World w):
    """Compute the force on each body in the world, w."""
    cdef int i, j
    cdef double s3, tmp
    cdef Point s
    cdef Point F
    for i in range(w.N):
        # Set all forces to zero. 
        w.F[i].x = 0
        w.F[i].y = 0
        for j in range(i+1, w.N):
            s.x = w.r[j].x - w.r[i].x
            s.y = w.r[j].y - w.r[i].y

            s3 = sqrt(s.x * s.x + s.y * s.y)
            s3 *= s3 * s3;

            tmp = w.m[i] * w.m[j] / s3
            F.x = tmp * s.x
            F.y = tmp * s.y

            w.F[i].x += F.x
            w.F[i].y += F.y

            w.F[j].x -= F.x
            w.F[j].y -= F.y

def evolve(World w, int steps):
    """Evolve the world, w, through the given number of steps."""
    cdef int _, i
    for _ in range(steps):
        for i in range(w.N):
            w.v[i].x += w.F[i].x * w.dt / w.m[i]
            w.v[i].y += w.F[i].y * w.dt / w.m[i]
            w.r[i].x += w.v[i].x * w.dt
            w.r[i].y += w.v[i].y * w.dt

The Cython version took about 30 minutes to write, and it runs just as fast as the C code — because, why wouldn’t it? It *is* C code, really, with just some syntactic sugar. And you don’t even have to learn or think about a foreign, complicated C API…You just, write C. Or C++ — although that’s a little more awkward. Both the Cython version and the C version are about 70x faster than the pure Python version, which uses Numpy arrays.

One difference from C: I wrote a little wrapper around malloc/free, cymem. All it does is remember the addresses it served, and when the Pool is garbage collected, it frees the memory it allocated. I’ve had no trouble with memory leaks since I started using this.

The “intermediate” way of writing Cython, using typed memory-views, allows you to use the Numpy multi-dimensional array features. However, to me it feels more complicated, and the applications I tend to write involve very sparse arrays — where, once again, I want to define my own data structures.

I found a Russian translation of this post here: . I don’t know how accurate it is.

2014/10/21 Posted by | Uncategorized | 8 Comments

Parsing English with 500 lines of Python

A syntactic parser describes a sentence’s grammatical structure, to help another application reason about it. Natural languages introduce many unexpected ambiguities, which our world-knowledge immediately filters out. A favourite example:

They ate the pizza with anchovies

A correct parse links “with” to “pizza”, while an incorrect parse links “with” to “eat”:

The Natural Language Processing (NLP) community has made big progress in syntactic parsing over the last few years. It’s now possible for a tiny Python implementation to perform better than the widely-used Stanford PCFG parser:

Update! The Stanford CoreNLP library now includes a greedy transition-based dependency parser, similar to the one described in this post, but with an improved learning strategy. It is much faster and more accurate than this simple Python implementation.

Parser Accuracy Speed (w/s) Language LOC
Stanford 89.6% 19 Java > 50,000[1] 89.8% 2,020 Python ~500
Redshift 93.6% 2,580 Cython ~4,000

The rest of the post sets up the problem, and then takes you through a concise implementation, prepared for this post. The first 200 lines of, the part-of-speech tagger and learner, are described here. You should probably at least skim that post before reading this one, unless you’re very familiar with NLP research.

The Cython system, Redshift, was written for my current research. I plan to improve it for general use in June, after my contract ends at Macquarie University. The current version is hosted on GitHub.

Problem Description

It’d be nice to type an instruction like this into your phone:

Set volume to zero when I’m in a meeting, unless John’s school calls.

And have it set the appropriate policy. On Android you can do this sort of thing with Tasker, but an NL interface would be much better. It’d be especially nice to receive a meaning representation you could edit, so you could see what it thinks you said, and correct it.

There are lots of problems to solve to make that work, but some sort of syntactic representation is definitely necessary. We need to know that:

Unless John’s school calls, when I’m in a meeting, set volume to zero

is another way of phrasing the first instruction, while:

Unless John’s school, call when I’m in a meeting

means something completely different.

A dependency parser returns a graph of word-word relationships, intended to make such reasoning easier. Our graphs will be trees — edges will be directed, and every node (word) will have exactly one incoming arc (one dependency, with its head), except one.

Example usage:

    >>> parser = parser.Parser()
    >>> tokens = "Set the volume to zero when I 'm in a meeting unless John 's school calls".split()
    >>> tags, heads = parser.parse(tokens)
    >>> heads
    [-1, 2, 0, 0, 3, 0, 7, 5, 7, 10, 8, 0, 13, 15, 15, 11]
    >>> for i, h in enumerate(heads): 
    ...   head = tokens[heads[h]] if h &amp;amp;gt;= 1 else 'None'
    ...   print(tokens[i] + ' <-- ' + head])
    Set <-- None
    the <-- volume
    volume <-- Set
    to <-- Set
    zero <-- to
    when <-- Set
    I <-- 'm
    'm <-- when
    in <-- 'm
    a <-- meeting
    meeting <-- in
    unless <-- Set
    John <-- 's
    's   <-- calls
    school <-- calls
    calls <-- unless

The idea is that it should be slightly easier to reason from the parse, than it was from the string. The parse-to-meaning mapping is hopefully simpler than the string-to-meaning mapping.

The most confusing thing about this problem area is that “correctness” is defined by convention — by annotation guidelines. If you haven’t read the guidelines and you’re not a linguist, you can’t tell whether the parse is “wrong” or “right”, which makes the whole task feel weird and artificial.

For instance, there’s a mistake in the parse above: “John’s school calls” is structured wrongly, according to the Stanford annotation guidelines. The structure of that part of the sentence is how the annotators were instructed to parse an example like “John’s school clothes”.

It’s worth dwelling on this point a bit. We could, in theory, have written our guidelines so that the “correct” parses were reversed. There’s good reason to believe the parsing task will be harder if we reversed our convention, as it’d be less consistent with the rest of the grammar.[2] But we could test that empirically, and we’d be pleased to gain an advantage by reversing the policy.

We definitely do want that distinction in the guidelines — we don’t want both to receive the same structure, or our output will be less useful. The annotation guidelines strike a balance between what distinctions downstream applications will find useful, and what parsers will be able to predict easily.

Projective trees

There’s a particularly useful simplification that we can make, when deciding what we want the graph to look like: we can restrict the graph structures we’ll be dealing with. This doesn’t just give us a likely advantage in learnability; it can have deep algorithmic implications. We follow most work on English in constraining the dependency graphs to be projective trees:

  1. Tree. Every word has exactly one head, except for the dummy ROOT symbol.
  2. Projective. For every pair of dependencies (a1, a2) and (b1, b2), if a1 < b2, then a2 >= b2. In other words, dependencies cannot “cross”. You can’t have a pair of dependencies that goes a1 b1 a2 b2, or b1 a1 b2 a2.

There’s a rich literature on parsing non-projective trees, and a smaller literature on parsing DAGs. But the parsing algorithm I’ll be explaining deals with projective trees.

Greedy transition-based parsing

Our parser takes as input a list of string tokens, and outputs a list of head indices, representing edges in the graph. If the ith member of heads is j, the dependency parse contains an edge (j, i). A transition-based parser is a finite-state transducer; it maps an array of N words onto an output array of N head indices:

start MSNBC reported that Facebook bought WhatsApp for $16bn root
0 2 9 2 4 2 4 4 7 0

The heads array denotes that the head of MSNBC is reported: MSNBC is word 1, and reported is word 2, and heads[1] == 2. You can already see why parsing a tree is handy — this data structure wouldn’t work if we had to output a DAG, where words may have multiple heads.

Although heads can be represented as an array, we’d actually like to maintain some alternate ways to access the parse, to make it easy and efficient to extract features. Our Parse class looks like this:

    class Parse(object):
        def __init__(self, n):
            self.n = n
            self.heads = [None] * (n-1)
            self.lefts = []
            self.rights = []
            for i in range(n+1):

        def add_arc(self, head, child):
            self.heads[child] = head
            if child &amp;amp;lt; head:

As well as the parse, we also have to keep track of where we’re up to in the sentence. We’ll do this with an index into the words array, and a stack, to which we’ll push words, before popping them once their head is set. So our state data structure is fundamentally:

  • An index, i, into the list of tokens;
  • The dependencies added so far, in Parse
  • A stack, containing words that occurred before i, for which we’re yet to assign a head.

Each step of the parsing process applies one of three actions to the state:

    SHIFT = 0; RIGHT = 1; LEFT = 2

    def transition(move, i, stack, parse):
        global SHIFT, RIGHT, LEFT
        if move == SHIFT:
            return i + 1
        elif move == RIGHT:
            parse.add_arc(stack[-2], stack.pop())
            return i
        elif move == LEFT:
            parse.add_arc(i, stack.pop())
            return i
        raise GrammarError("Unknown move: %d" % move)

The LEFT and RIGHT actions add dependencies and pop the stack, while SHIFT pushes the stack and advances i into the buffer.

So, the parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the (valid) actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. (It’s hard to understand this sort of algorithm without stepping through it. Try coming up with a sentence, drawing a projective parse tree over it, and then try to reach the parse tree by choosing the right sequence of transitions.)

Here’s what the parsing loop looks like in code:

    class Parser(object):
        def parse(self, words):
            tags = self.tagger(words)
            n = len(words)
            idx = 1
            stack = [0]
            deps = Parse(n)
            while stack or idx &amp;amp;lt; n:
                features = extract_features(words, tags, idx, n, stack, deps)
                scores = self.model.score(features)
                valid_moves = get_valid_moves(i, n, len(stack))
                next_move = max(valid_moves, key=lambda move: scores[move])
                idx = transition(next_move, idx, stack, parse)
            return tags, parse

    def get_valid_moves(i, n, stack_depth):
        moves = []
        if i &amp;amp;lt; n:
        if stack_depth <= 2:
        if stack_depth <= 1:
        return moves

We start by tagging the sentence, and initializing the state. We then map the state to a set of features, which we score using a linear model. We then find the best-scoring valid move, and apply it to the state.

The model scoring works the same as it did in the POS tagger. If you’re confused about the idea of extracting features and scoring them with a linear model, you should review that post. Here’s a reminder of how the model scoring works:

    class Perceptron(object)
        def score(self, features):
            all_weights = self.weights
            scores = dict((clas, 0) for clas in self.classes)
            for feat, value in features.items():
                if value == 0:
                if feat not in all_weights:
                weights = all_weights[feat]
                for clas, weight in weights.items():
                    scores[clas] += value * weight
            return scores

It’s just summing the class-weights for each feature. This is often expressed as a dot-product, but when you’re dealing with multiple classes, that gets awkward, I find.

The beam parser (RedShift) tracks multiple candidates, and only decides on the best one at the very end. We’re going to trade away accuracy in favour of efficiency and simplicity. We’ll only follow a single analysis. Our search strategy will be entirely greedy, as it was with the POS tagger. We’ll lock-in our choices at every step.

If you read the POS tagger post carefully, you might see the underlying similarity. What we’ve done is mapped the parsing problem onto a sequence-labelling problem, which we address using a “flat”, or unstructured, learning algorithm (by doing greedy search).


Feature extraction code is always pretty ugly. The features for the parser refer to a few tokens from the context::

  • The first three words of the buffer (n0, n1, n2)
  • The top three words of the stack (s0, s1, s2)
  • The two leftmost children of s0 (s0b1, s0b2);
  • The two rightmost children of s0 (s0f1, s0f2);
  • The two leftmost children of n0 (n0b1, n0b2)

For these 12 tokens, we refer to the word-form, the part-of-speech tag, and the number of left and right children attached to the token.

Because we’re using a linear model, we have our features refer to pairs and triples of these atomic properties.

def extract_features(words, tags, n0, n, stack, parse):
    def get_stack_context(depth, stack, data):
        if depth >= 3:
            return data[stack[-1]], data[stack[-2]], data[stack[-3]]
        elif depth >= 2:
            return data[stack[-1]], data[stack[-2]], ''
        elif depth == 1:
            return data[stack[-1]], '', ''
            return '', '', ''

    def get_buffer_context(i, n, data):
        if i + 1 >= n:
            return data[i], '', ''
        elif i + 2 >= n:
            return data[i], data[i + 1], ''
            return data[i], data[i + 1], data[i + 2]

    def get_parse_context(word, deps, data):
        if word == -1:
            return 0, '', ''
        deps = deps[word]
        valency = len(deps)
        if not valency:
            return 0, '', ''
        elif valency == 1:
            return 1, data[deps[-1]], ''
            return valency, data[deps[-1]], data[deps[-2]]

    features = {}
    # Set up the context pieces --- the word, W, and tag, T, of:
    # S0-2: Top three words on the stack
    # N0-2: First three words of the buffer
    # n0b1, n0b2: Two leftmost children of the first word of the buffer
    # s0b1, s0b2: Two leftmost children of the top word of the stack
    # s0f1, s0f2: Two rightmost children of the top word of the stack

    depth = len(stack)
    s0 = stack[-1] if depth else -1

    Ws0, Ws1, Ws2 = get_stack_context(depth, stack, words)
    Ts0, Ts1, Ts2 = get_stack_context(depth, stack, tags)

    Wn0, Wn1, Wn2 = get_buffer_context(n0, n, words)
    Tn0, Tn1, Tn2 = get_buffer_context(n0, n, tags)

    Vn0b, Wn0b1, Wn0b2 = get_parse_context(n0, parse.lefts, words)
    Vn0b, Tn0b1, Tn0b2 = get_parse_context(n0, parse.lefts, tags)

    Vn0f, Wn0f1, Wn0f2 = get_parse_context(n0, parse.rights, words)
    _, Tn0f1, Tn0f2 = get_parse_context(n0, parse.rights, tags)

    Vs0b, Ws0b1, Ws0b2 = get_parse_context(s0, parse.lefts, words)
    _, Ts0b1, Ts0b2 = get_parse_context(s0, parse.lefts, tags)

    Vs0f, Ws0f1, Ws0f2 = get_parse_context(s0, parse.rights, words)
    _, Ts0f1, Ts0f2 = get_parse_context(s0, parse.rights, tags)

    # Cap numeric features at 5? 
    # String-distance
    Ds0n0 = min((n0 - s0, 5)) if s0 != 0 else 0

    features['bias'] = 1
    # Add word and tag unigrams
    for w in (Wn0, Wn1, Wn2, Ws0, Ws1, Ws2, Wn0b1, Wn0b2, Ws0b1, Ws0b2, Ws0f1, Ws0f2):
        if w:
            features['w=%s' % w] = 1
    for t in (Tn0, Tn1, Tn2, Ts0, Ts1, Ts2, Tn0b1, Tn0b2, Ts0b1, Ts0b2, Ts0f1, Ts0f2):
        if t:
            features['t=%s' % t] = 1

    # Add word/tag pairs
    for i, (w, t) in enumerate(((Wn0, Tn0), (Wn1, Tn1), (Wn2, Tn2), (Ws0, Ts0))):
        if w or t:
            features['%d w=%s, t=%s' % (i, w, t)] = 1

    # Add some bigrams
    features['s0w=%s,  n0w=%s' % (Ws0, Wn0)] = 1
    features['wn0tn0-ws0 %s/%s %s' % (Wn0, Tn0, Ws0)] = 1
    features['wn0tn0-ts0 %s/%s %s' % (Wn0, Tn0, Ts0)] = 1
    features['ws0ts0-wn0 %s/%s %s' % (Ws0, Ts0, Wn0)] = 1
    features['ws0-ts0 tn0 %s/%s %s' % (Ws0, Ts0, Tn0)] = 1
    features['wt-wt %s/%s %s/%s' % (Ws0, Ts0, Wn0, Tn0)] = 1
    features['tt s0=%s n0=%s' % (Ts0, Tn0)] = 1
    features['tt n0=%s n1=%s' % (Tn0, Tn1)] = 1

    # Add some tag trigrams
    trigrams = ((Tn0, Tn1, Tn2), (Ts0, Tn0, Tn1), (Ts0, Ts1, Tn0), 
                (Ts0, Ts0f1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Tn0, Tn0b1),
                (Ts0, Ts0b1, Ts0b2), (Ts0, Ts0f1, Ts0f2), (Tn0, Tn0b1, Tn0b2),
                (Ts0, Ts1, Ts1))
    for i, (t1, t2, t3) in enumerate(trigrams):
        if t1 or t2 or t3:
            features['ttt-%d %s %s %s' % (i, t1, t2, t3)] = 1

    # Add some valency and distance features
    vw = ((Ws0, Vs0f), (Ws0, Vs0b), (Wn0, Vn0b))
    vt = ((Ts0, Vs0f), (Ts0, Vs0b), (Tn0, Vn0b))
    d = ((Ws0, Ds0n0), (Wn0, Ds0n0), (Ts0, Ds0n0), (Tn0, Ds0n0),
        ('t' + Tn0+Ts0, Ds0n0), ('w' + Wn0+Ws0, Ds0n0))
    for i, (w_t, v_d) in enumerate(vw + vt + d):
        if w_t or v_d:
            features['val/d-%d %s %d' % (i, w_t, v_d)] = 1
    return features


Weights are learned using the same algorithm, averaged perceptron, that we used for part-of-speech tagging. Its key strength is that it’s an online learning algorithm: examples stream in one-by-one, we make our prediction, check the actual answer, and adjust our beliefs (weights) if we were wrong.

The training loop looks like this:

class Parser(object):
    def train_one(self, itn, words, gold_tags, gold_heads):
        n = len(words)
        i = 2; stack = [1]; parse = Parse(n)
        tags = self.tagger.tag(words)
        while stack or (i + 1) &amp;amp;lt; n:
            features = extract_features(words, tags, i, n, stack, parse)
            scores = self.model.score(features)
            valid_moves = get_valid_moves(i, n, len(stack))
            guess = max(valid_moves, key=lambda move: scores[move])
            gold_moves = get_gold_moves(i, n, stack, parse.heads, gold_heads)
            best = max(gold_moves, key=lambda move: scores[move])
        self.model.update(best, guess, features)
        i = transition(guess, i, stack, parse)
    # Return number correct
    return len([i for i in range(n-1) if parse.heads[i] == gold_heads[i]])

The most interesting part of the training process is in get_gold_moves. The performance of our parser is made possible by an advance by Goldberg and Nivre (2012), who showed that we’d been doing this wrong for years.

In the POS-tagging post, I cautioned that during training you need to make sure you pass in the last two predicted tags as features for the current tag, not the last two gold tags. At test time you’ll only have the predicted tags, so if you base your features on the gold sequence during training, your training contexts won’t resemble your test-time contexts, so you’ll learn the wrong weights.

In parsing, the problem was that we didn’t know how to pass in the predicted sequence! Training worked by taking the gold-standard tree, and finding a transition sequence that led to it. i.e., you got back a sequence of moves, with the guarantee that if you followed those moves, you’d get the gold-standard dependencies.

The problem is, we didn’t know how to define the “correct” move to teach a parser to make if it was in any state that wasn’t along that gold-standard sequence. Once the parser had made a mistake, we didn’t know how to train from that example.

That was a big problem, because it meant that once the parser started making mistakes, it would end up in states unlike any in its training data — leading to yet more mistakes.
The problem was specific to greedy parsers: once you use a beam, there’s a natural way to do structured prediction.

The solution seems obvious once you know it, like all the best breakthroughs. What we do is define a function that asks “How many gold-standard dependencies can be recovered from this state?”. If you can define that function, then you can apply each move in turn, and ask, “How many gold-standard dependencies can be recovered from this state?”. If the action you applied allows fewer gold-standard dependencies to be reached, then it is sub-optimal.

That’s a lot to take in.

So we have this function Oracle(state):

Oracle(state) = | gold_arcs ∩ reachable_arcs(state) |

We also have a set of actions, each of which returns a new state. We want to know:

  • shift_cost = Oracle(state) – Oracle(shift(state))
  • right_cost = Oracle(state) – Oracle(right(state))
  • left_cost = Oracle(state) – Oracle(left(state))

Now, at least one of those costs has to be zero. Oracle(state) is asking, “what’s the cost of the best path forward?”, and the first action of that best path has to be shift, right, or left.

It turns out that we can derive Oracle fairly simply for many transition systems. The derivation for the transition system we’re using, Arc Hybrid, is in Goldberg and Nivre (2013).

We’re going to implement the oracle as a function that returns the zero-cost moves, rather than implementing a function Oracle(state). This prevents us from doing a bunch of costly copy operations. Hopefully the reasoning in the code isn’t too hard to follow, but you can also consult Goldberg and Nivre’s papers if you’re confused and want to get to the bottom of this.

def get_gold_moves(n0, n, stack, heads, gold):
    def deps_between(target, others, gold):
        for word in others:
            if gold[word] == target or gold[target] == word:
                return True
        return False

    valid = get_valid_moves(n0, n, len(stack))
    if not stack or (SHIFT in valid and gold[n0] == stack[-1]):
        return [SHIFT]
    if gold[stack[-1]] == n0:
        return [LEFT]
    costly = set([m for m in MOVES if m not in valid])
    # If the word behind s0 is its gold head, Left is incorrect
    if len(stack) &amp;amp;gt;= 2 and gold[stack[-1]] == stack[-2]:
    # If there are any dependencies between n0 and the stack,
    # pushing n0 will lose them.
    if SHIFT not in costly and deps_between(n0, stack, gold):
    # If there are any dependencies between s0 and the buffer, popping
    # s0 will lose them.
    if deps_between(stack[-1], range(n0+1, n-1), gold):
    return [m for m in MOVES if m not in costly]

Doing this “dynamic oracle” training procedure makes a big difference to accuracy — typically 1-2%, with no difference to the way the run-time works. The old “static oracle” greedy training procedure is fully obsolete; there’s no reason to do it that way any more.


I have the sense that language technologies, particularly those relating to grammar, are particularly mysterious. I can imagine having no idea what the program might even do.

I think it therefore seems natural to people that the best solutions would be over-whelmingly complicated. A 200,000 line Java package feels appropriate.

But, algorithmic code is usually short, when only a single algorithm is implemented. And when you only implement one algorithm, and you know exactly what you want to write before you write a line, you also don’t pay for any unnecessary abstractions, which can have a big performance impact.


[1] I wasn’t really sure how to count the lines of code in the Stanford parser. Its jar file ships over 200k, but there are a lot of different models in it. It’s not important, but over 50k seems safe.

[2] For instance, how would you parse, “John’s school of music calls”? You want to make sure the phrase “John’s school” has a consistent structure in both “John’s school calls” and “John’s school of music calls”. Reasoning about the different “slots” you can put a phrase into is a key way we reason about what syntactic analyses look like. You can think of each phrase as having a different shaped connector, which you need to plug into different slots — which each phrase also has a certain number of, each of a different shape. We’re trying to figure out what connectors are where, so we can figure out how the sentences are put together.

[3] There’s an updated version of the Stanford parser that gets better accuracy, using a “deep learning” technique. But, the accuracy of the final model is still way behind the best shift-reduce parsers. It’s a great paper, and it doesn’t really matter that the idea was implemented on top of a parser that isn’t state-of-the-art. It seems very likely that the idea would still work on top of a shift-reduce parser, and I look forward to someone doing that.

[4] A point of detail: the Stanford dependencies are actually produced automatically given gold-standard phrase-structure trees. See the Stanford Dependency Converter page here:

Idle speculation

For a long time, incremental language processing algorithms were primarily of scientific interest. If you want to write a parser to test a theory about how the human sentence processor might work, well, that parser needs to build partial interpretations. There’s a wealth of evidence, including commonsense introspection, that establishes that we don’t buffer input and analyse it once the speaker has finished.

But now algorithms with that neat scientific feature are winning! As best as I can tell, the secret to that success is to be:

  • Incremental. Earlier words constrain the search.
  • Error-driven. Training involves a working hypothesis, which is updated as it makes mistakes.

The links to human sentence processing seem tantalising. I look forward to seeing whether these engineering breakthroughs lead to any psycholinguistic advances.


The NLP literature is almost entirely open access. All of the relavant papers can be found here:

The parser I’ve described is an implementation of the dynamic-oracle Arc-Hybrid system here:

Goldberg, Yoav; Nivre, Joakim
Training Deterministic Parsers with Non-Deterministic Oracles
TACL 2013

However, I wrote my own features for it. The arc-hybrid system was originally described here:

Kuhlmann, Marco; Gomez-Rodriguez, Carlos; Satta, Giorgio
Dynamic programming algorithms for transition-based dependency parsers
ACL 2011

The dynamic oracle training method was first described here:

A Dynamic Oracle for Arc-Eager Dependency Parsing
Goldberg, Yoav; Nivre, Joakim

This work depended on a big break-through in accuracy for transition-based parsers, when beam-search was properly explored by Zhang and Clark. They have several papers, but the preferred citation is:

Zhang, Yue; Clark, Steven
Syntactic Processing Using the Generalized Perceptron and Beam Search
Computational Linguistics 2011 (1)

Another important paper was this little feature engineering paper, which further improved the accuracy:

Zhang, Yue;  Nivre, Joakim
Transition-based Dependency Parsing with Rich Non-local Features
ACL 2011

The generalised perceptron, which is the learning framework for these beam parsers, is from this paper:

Collins, Michael
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms
EMNLP 2002

Experimental details

The results at the start of the post refer to Section 22 of the Wall Street Journal corpus. The Stanford parser was run as follows:

java -mx10000m -cp &amp;amp;quot;$scriptdir/*:&amp;amp;quot; edu.stanford.nlp.parser.lexparser.LexicalizedParser \
-outputFormat &amp;amp;quot;penn&amp;amp;quot; edu/stanford/nlp/models/lexparser/englishFactored.ser.gz $*

A small post-process was applied, to undo the fancy tokenisation Stanford adds for numbers, to make them match the PTB tokenisation:

&amp;amp;quot;&amp;amp;quot;&amp;amp;quot;Stanford parser retokenises numbers. Split them.&amp;amp;quot;&amp;amp;quot;&amp;amp;quot;
import sys
import re

qp_re = re.compile('\xc2\xa0')
for line in sys.stdin:
    line = line.rstrip()
        line = line.replace('(CD', '(QP (CD', 1) + ')'
        line = line.replace('\xc2\xa0', ') (CD ')
    print line

The resulting PTB-format files were then converted into dependencies using the Stanford converter:

for f in $1/*.mrg; do
  echo $f
  grep -v CODE $f &amp;amp;amp;gt; &amp;amp;quot;$f.2&amp;amp;quot;
  java -mx800m -cp &amp;amp;quot;$scriptdir/*:&amp;amp;quot; edu.stanford.nlp.trees.EnglishGrammaticalStructure \
   -treeFile &amp;amp;quot;$f.2&amp;amp;quot; -basic -makeCopulaHead -conllx &amp;amp;amp;gt; $out

I can’t easily read that anymore, but it should just convert every .mrg file in a folder to a CoNLL-format Stanford basic dependencies file, using the settings common in the dependency literature.

I then converted the gold-standard trees from WSJ 22, for the evaluation. Accuracy scores refer to unlabelled attachment score (i.e. the head index) of all non-punctuation tokens.

To train, I fed the gold-standard PTB trees for WSJ 02-21 into the same conversion script.

In a nutshell: The Stanford model and are trained on the same set of sentences, and they each make their predictions on a held-out test set, for which we know the answers. Accuracy refers to how many of the words’ heads we got correct.

Speeds were measured on a 2.4Ghz Xeon. I ran the experiments on a server, to give the Stanford parser more memory. The system runs fine on my MacBook Air. I used PyPy for the experiments; CPython was about half as fast on an early benchmark.

One of the reasons is so fast is that it does unlabelled parsing. Based on previous experiments, a labelled parser would likely be about 40x slower, and about 1% more accurate. Adapting the program to labelled parsing would be a good exercise for the reader, if you have access to the data.

The result from the Redshift parser was produced from commit b6b624c9900f3bf, which was run as follows:

./scripts/ -x zhang+stack -k 8 -p ~/data/stanford/train.conll ~/data/parsers/tmp
./scripts/ ~/data/parsers/tmp ~/data/stanford/devi.txt /tmp/parse/
./scripts/ /tmp/parse/parses ~/data/stanford/dev.conll

2013/12/18 Posted by | Uncategorized | 21 Comments

A good POS tagger in about 200 lines of Python

Update: For a fast and accurate text-processing from Python, see

Up-to-date knowledge about natural language processing is mostly locked away in academia. And academics are mostly pretty self-conscious when we write. We’re careful. We don’t want to stick our necks out too much. But under-confident recommendations suck, so here’s how to write a good part-of-speech tagger.

There are a tonne of “best known techniques” for POS tagging, and you should ignore the others and just use Averaged Perceptron.

You should use two tags of history, and features derived from the Brown word clusters distributed here.

If you only need the tagger to work on carefully edited text, you should use case-sensitive features, but if you want a more robust tagger you should avoid them because they’ll make you over-fit to the conventions of your training domain. Instead, features that ask “how frequently is this word title-cased, in a large sample from the web?” work well. Then you can lower-case your comparatively tiny training corpus.

For efficiency, you should figure out which frequent words in your training data have unambiguous tags, so you don’t have to do anything but output their tags when they come up. About 50% of the words can be tagged that way.

And unless you really, really can’t do without an extra 0.1% of accuracy, you probably shouldn’t bother with any kind of search strategy — you should just use a greedy model.

If you do all that, you’ll find your tagger easy to write and understand, and an efficient Cython implementation will perform as follows on the standard evaluation, 130,000 words of text from the Wall Street Journal:

Tagger Accuracy Time (130k words)
CyGreedyAP 97.1% 4s

The 4s includes initialisation time — the actual per-token speed is high enough to be irrelevant; it won’t be your bottleneck.

It’s tempting to look at 97% accuracy and say something similar, but that’s not true. My parser is about 1% more accurate if the input has hand-labelled POS tags, and the taggers all perform much worse on out-of-domain data. Unfortunately accuracies have been fairly flat for the last ten years. That’s why my recommendation is to just use a simple and fast tagger that’s roughly as good.

The thing is though, it’s very common to see people using taggers that aren’t anywhere near that good!  For an example of what a non-expert is likely to use, these were the two taggers wrapped by TextBlob, a new Python api that I think is quite neat:

Tagger Accuracy Time (130k words)
NLTK 94.0% 3m56s
Pattern 93.5% 26s

Both Pattern and NLTK are very robust and beautifully well documented, so the appeal of using them is obvious. But Pattern’s algorithms are pretty crappy, and NLTK carries tremendous baggage around in its implementation because of its massive framework, and double-duty as a teaching tool.

As a stand-alone tagger, my Cython implementation is needlessly complicated — it was written for my parser. So today I wrote a 200 line version of my recommended algorithm for TextBlob. It gets:

Tagger Accuracy Time (130k words)
PyGreedyAP 96.8% 12s

I traded some accuracy and a lot of efficiency to keep the implementation simple. Here’s a far-too-brief description of how it works.

Averaged perceptron

POS tagging is a “supervised learning problem”. You’re given a table of data, and you’re told that the values in the last column will be missing during run-time. You have to find correlations from the other columns to predict that value.

So for us, the missing column will be “part of speech at word i“. The predictor columns (features) will be things like “part of speech at word i-1“, “last three letters of word at i+1“, etc

First, here’s what prediction looks like at run-time:

  def predict(self, features):
      '''Dot-product the features and current weights and return the best class.'''
      scores = defaultdict(float)
      for feat in features:
          if feat not in self.weights:
          weights = self.weights[feat]
          for clas, weight in weights.items():
              scores[clas] += weight
      # Do a secondary alphabetic sort, for stability
      return max(self.classes, key=lambda clas: (scores[clas], clas))

Earlier I described the learning problem as a table, with one of the columns marked as missing-at-runtime. For NLP, our tables are always exceedingly sparse. You have columns like “word i-1=Parliament”, which is almost always 0. So our “weight vectors” can pretty much never be implemented as vectors. Map-types are good though — here we use dictionaries.

The input data, features, is a set with a member for every non-zero “column” in our “table” — every active feature. Usually this is actually a dictionary, to let you set values for the features. But here all my features are binary present-or-absent type deals.

The weights data-structure is a dictionary of dictionaries, that ultimately associates feature/class pairs with some weight. You want to structure it this way instead of the reverse because of the way word frequencies are distributed: most words are rare, frequent words are very frequent.

Learning the weights

Okay, so how do we get the values for the weights? We start with an empty weights dictionary, and iteratively do the following:

  1. Receive a new (features, POS-tag) pair
  2. Guess the value of the POS tag given the current “weights” for the features
  3. If guess is wrong, add +1 to the weights associated with the correct class for these features, and -1 to the weights for the predicted class.

It’s one of the simplest learning algorithms. Whenever you make a mistake, increment the weights for the correct class, and penalise the weights that led to your false prediction. In code:

  def train(self, nr_iter, examples):
      for i in range(nr_iter):
          for features, true_tag in examples:
              guess = self.predict(features)
              if guess != true_tag:
                  for f in features:
                      self.weights[f][true_tag] += 1
                      self.weights[f][guess] -= 1

If you iterate over the same example this way, the weights for the correct class would have to come out ahead, and you’d get the example right. If you think about what happens with two examples, you should be able to see that it will get them both right unless the features are identical. In general the algorithm will converge so long as the examples are linearly separable, although that doesn’t matter for our purpose.

Averaging the weights

We need to do one more thing to make the perceptron algorithm competitive. The problem with the algorithm so far is that if you train it twice on slightly different sets of examples, you end up with really different models. It doesn’t generalise that smartly. And the problem is really in the later iterations — if you let it run to convergence, it’ll pay lots of attention to the few examples it’s getting wrong, and mutate its whole model around them.

So, what we’re going to do is make the weights more `sticky’ — give the model less chance to ruin all its hard work in the later rounds. And we’re going to do that by returning the averaged weights, not the final weights.

I doubt there are many people who are convinced that’s the most obvious solution to the problem, but whatever. We’re not here to innovate, and this way is time tested on lots of problems. If you have another idea, run the experiments and tell us what you find. Actually I’d love to see more work on this, now that the averaged perceptron has become such a prominent learning algorithm in NLP.

Okay. So this averaging. How’s that going to work? Note that we don’t want to just average after each outer-loop iteration. We want the average of all the values — from the inner loop. So if we have 5,000 examples, and we train for 10 iterations, we’ll average across 50,000 values for each weight.

Obviously we’re not going to store all those intermediate values. Instead, we’ll track an accumulator for each weight, and divide it by the number of iterations at the end. Again: we want the average weight assigned to a feature/class pair during learning, so the key component we need is the total weight it was assigned. But we also want to be careful about how we compute that accumulator, too. On almost any instance, we’re going to see a tiny fraction of active feature/class pairs. All the other feature/class weights won’t change. So we shouldn’t have to go back and add the unchanged value to our accumulators anyway, like chumps.

Since we’re not chumps, we’ll make the obvious improvement. We’ll maintain another dictionary that tracks how long each weight has gone unchanged. Now when we do change a weight, we can do a fast-forwarded update to the accumulator, for all those iterations where it lay unchanged.

Here’s what a weight update looks like now that we have to maintain the totals and the time-stamps:

  def update(self, truth, guess, features):
      def upd_feat(c, f, v):
          nr_iters_at_this_weight = self.i - self._timestamps[f][c]
          self._totals[f][c] += nr_iters_at_this_weight * self.weights[f][c]
          self.weights[f][c] += v
          self._timestamps[f][c] = self.i

      self.i += 1
      for f in features:
          upd_feat(truth, f, 1.0)
          upd_feat(guess, f, -1.0)

Features and pre-processing

The POS tagging literature has tonnes of intricate features sensitive to case, punctuation, etc. They help on the standard test-set, which is from Wall Street Journal articles from the 1980s, but I don’t see how they’ll help us learn models that are useful on other text.

To help us learn a more general model, we’ll pre-process the data prior to feature extraction, as follows:

  • All words are lower cased;
  • Digits in the range 1800-2100 are represented as !YEAR;
  • Other digit strings are represented as !DIGITS

It would be better to have a module recognising dates, phone numbers, emails, hash-tags, etc. but that will have to be pushed back into the tokenization.

I played around with the features a little, and this seems to be a reasonable bang-for-buck configuration in terms of getting the development-data accuracy to 97% (where it typically converges anyway), and having a smaller memory foot-print:

    def _get_features(self, i, word, context, prev, prev2):
        '''Map tokens-in-contexts into a feature representation, implemented as a
        set. If the features change, a new model must be trained.'''
        def add(name, *args):
            features.add('+'.join((name,) + tuple(args)))

        features = set()
        add('bias') # This acts sort of like a prior
        add('i suffix', word[-3:])
        add('i pref1', word[0])
        add('i-1 tag', prev)
        add('i-2 tag', prev2)
        add('i tag+i-2 tag', prev, prev2)
        add('i word', context[i])
        add('i-1 tag+i word', prev, context[i])
        add('i-1 word', context[i-1])
        add('i-1 suffix', context[i-1][-3:])
        add('i-2 word', context[i-2])
        add('i+1 word', context[i+1])
        add('i+1 suffix', context[i+1][-3:])
        add('i+2 word', context[i+2])
        return features

I haven’t added any features from external data, such as case frequency statistics from the Google Web 1T corpus. I might add those later, but for now I figured I’d keep things simple.

What about search?

The model I’ve recommended commits to its predictions on each word, and moves on to the next one. Those predictions are then used as features for the next word. There’s a potential problem here, but it turns out it doesn’t matter much. It’s easy to fix with beam-search, but I say it’s not really worth bothering. And it definitely doesn’t matter enough to adopt a slow and complicated algorithm like Conditional Random Fields.

Here’s the problem. The best indicator for the tag at position, say, 3 in a sentence is the word at position 3. But the next-best indicators are the tags at positions 2 and 4. So there’s a chicken-and-egg problem: we want the predictions for the surrounding words in hand before we commit to a prediction for the current word. Here’s an example where search might matter:

Their management plan reforms worked

Depending on just what you’ve learned from your training data, you can imagine making a different decision if you started at the left and moved right, conditioning on your previous decisions, than if you’d started at the right and moved left.

If that’s not obvious to you, think about it this way: “worked” is almost surely a verb, so if you tag “reforms” with that in hand, you’ll have a different idea of its tag than if you’d just come from “plan“, which you might have regarded as either a noun or a verb.

Search can only help you when you make a mistake. It can prevent that error from throwing off your subsequent decisions, or sometimes your future choices will correct the mistake. And that’s why for POS tagging, search hardly matters! Your model is so good straight-up that your past predictions are almost always true. So you really need the planets to align for search to matter at all.

And as we improve our taggers, search will matter less and less. Instead of search, what we should be caring about is multi-tagging. If we let the model be a bit uncertain, we can get over 99% accuracy assigning an average of 1.05 tags per word (Vadas et al, ACL 2006). The averaged perceptron is rubbish at multi-tagging though. That’s its big weakness. You really want a probability distribution for that.

One caveat when doing greedy search, though. It’s very important that your training data model the fact that the history will be imperfect at run-time. Otherwise, it will be way over-reliant on the tag-history features. Because the Perceptron is iterative, this is very easy.

Here’s the training loop for the tagger:

    def train(self, sentences, save_loc=None, nr_iter=5, quiet=False):
        '''Train a model from sentences, and save it at save_loc. nr_iter
        controls the number of Perceptron training iterations.'''
        self._make_tagdict(sentences, quiet=quiet)
        self.model.classes = self.classes
        prev, prev2 = START
        for iter_ in range(nr_iter):
            c = 0; n = 0
            for words, tags in sentences:
                context = START + [self._normalize(w) for w in words] + END
                for i, word in enumerate(words):
                    guess = self.tagdict.get(word)
                    if not guess:
                        feats = self._get_features(i, word, context, prev, prev2)
                        guess = self.model.predict(feats)
                        self.model.update(tags[i], guess, feats)
                    # Set the history features from the guesses, not the true tags
                    prev2 = prev; prev = guess
                    c += guess == tags[i]; n += 1
            if not quiet:
                print(&quot;Iter %d: %d/%d=%.3f&quot; % (iter_, c, n, _pc(c, n)))
        # Pickle as a binary file
        if save_loc is not None:
            cPickle.dump((self.model.weights, self.tagdict, self.classes),
                         open(save_loc, 'wb'), -1)

Unlike the previous snippets, this one’s literal — I tended to edit the previous ones to simplify. So if they have bugs, hopefully that’s why!

At the time of writing, I’m just finishing up the implementation before I submit a pull request to TextBlob. You can see the rest of the source here:

A final comparison…

Over the years I’ve seen a lot of cynicism about the WSJ evaluation methodology. The claim is that we’ve just been meticulously over-fitting our methods to this data. Actually the evidence doesn’t really bear this out. Mostly, if a technique is clearly better on one evaluation, it improves others as well. Still, it’s very reasonable to want to know how these tools perform on other text. So I ran the unchanged models over two other sections from the OntoNotes corpus:

Tagger WSJ ABC Web
Pattern 93.5 90.7 88.1
NLTK 94.0 91.5 88.4
PyGreedyAP 96.8 94.8 91.8

The ABC section is broadcast news, Web is text from the web (blogs etc — I haven’t looked at the data much).

As you can see, the order of the systems is stable across the three comparisons, and the advantage of our Averaged Perceptron tagger over the other two is real enough. Actually the pattern tagger does very poorly on out-of-domain text. It mostly just looks up the words, so it’s very domain dependent. I hadn’t realised it before, but it’s obvious enough now that I think about it.

We can improve our score greatly by training on some of the foreign data. The technique described in this paper (Daume III, 2007) is the first thing I try when I have to do that.

P.S. The format of these tables sucks. Probably I should move off to actual hosting, but in the meantime, any suggestions?

2013/09/11 Posted by | Uncategorized | 45 Comments

A Simple Extractive Summarisation System

This post talks you through the code of a simple extractive summarisation system written in Python. There was a bit of interest in a summariser posted to reddit a few days ago, which inspired me to do some tinkering. I’ve never really engaged with the extensive summarisation literature, and I doubt I’ll be publishing in it anytime soon. I mostly built this for fun, and to teach myself a little about publishing an application with django and Google App Engine.

The system can be found at (at least while I’m within my free quota). The version at app engine is slightly different from the one described below, as I’ve added a few optimisations to fit within the memory limit and make my CPU time stretch a little further.

You can find a version of the GAE ready code here, under the FreeBSD license.


We can divide the system into three basic components:

  1. Text pre-processor. This handles things like dividing the document into sentences, dividing the sentences into words, normalising away suffixes, eliminating common words, etc.
  2. The sentence selector. This is the engine of the system. We will base our selection criteria on word frequencies. This is a common starting point for many NLP tasks.
  3. Sentence sorter. This sorts the selected sentences to produce a pleasing summary. We will simply order the sentences by their position in the document.

The Summariser Class

class Summariser(object):
    def __call__(self, doc, n):
        Return the N most pertinent sentences of a document,
        in the order in which they occur
        # A 'bag-of-words' is a sparse vector
        # of token frequencies.
        docBow = doc.bagOfWords
        # Get N most similar.
        topN = sorted(doc, key=lambda s:
            self.similarity(docBow, s.bagOfWords))[-n:]
        # Sort back into document order.
        return sorted(topN, key=lambda s: s.position)

    def similarity(self, docBow, sentBow):
        # Sum the frequencies of each token in the sentence
        # in the rest of the document
        return sum((docBow[t]-f)*f for t, f in sentBow.items())

The Summariser.__call__ function takes the processed document and performs two sorts, trimming the document down to N sentences after the first sort by similarity. The similarity score for each sentence is defined as the sum of the frequency of its words in other sentences of the document. We exclude the sentence itself, so that we avoid selecting long sentences that contain terms not in the rest of the document. The idea here is to find sentences that are prototypical of the document — that is, the most like the rest of the document, in terms of their word frequency distribution.

Pre-processing with nltk

The summariser expects the document to have the text divided into sentences, the sentences divided into words, and word frequencies counted within both. To do this, we call some functions from nltk The pre-processing stage takes the document from a raw string to a list of sentences, each of which contains a list of tokens. We will also index the tokens in the document into a “bag-of-words”: a sparse vector that stores each word’s frequency. This is implemented as a defaultdictionary. The Document and Sentence classes look like this:

from collections import defaultdict
from nltk.tokenize import sent_tokenize as sent_tokenise
from nltk.tokenize import word_tokenize as word_tokenise

class Document(list):
    def __init__(self, **kwargs):
        Build a list of sentences and a bag of words
        if 'text' in kwargs:
            text = kwargs.pop('text')
        elif 'url' in kwargs:
            text = self._urlToText(kwargs.pop('url'))
            raise StandardError, "Document requires text or url"
        assert not kwargs
        bow = defaultdict(int)
        for i, sentenceStr in enumerate(sent_tokenise(text)):
            sentence = Sentence(sentenceStr, i)
            for k, v in sentence.bagOfWords.items():
                bow[k] += v
        self.bagOfWords = bow

class Sentence(list):
    def __init__(self, sentenceStr, position):
        self.string = cgi.escape(sentenceStr)
        # Lowercase the first word
        if sentenceStr[0].isupper():
            letters = list(sentenceStr)
            letters[0] = letters[0].lower()
            sentenceStr = ''.join(letters)
        tokens = word_tokenise(sentenceStr)
        bow = defaultdict(int)
        for token in tokens:
            # Normalise for morphology and case,
            # and exclude common words
            term = standardise(token)
            if term:
                bow[term] += 1
        self.bagOfWords = bow
        self.position = position

All the real work here is done by the nltk Punkt sentence boundary detector, and the tokeniser.  These tasks aren’t that interesting, but they’re surprisingly hard to get just right. Sentence boundary detection is a good example of this: good solutions are much more complicated than you might think. I would advise against rolling your own sentence boundary detector with a regular expression, for the same reason regular expressions are a poor choice for parsing html: it’s an error-prone strategy that’s fundamentally not powerful enough for the task. Use the good solutions that work right out of the box. Speaking of html parsing, the process that extracts text from web pages is, in my opinion, the weakest part of the system. I’d really appreciate feedback on a better way to do this. The function looks like this:

whiteRE = re.compile(r'\s+')
    constraintsRE = re.compile(u'^[^\u2022]|[.!?]$')
    def _urlToText(self, url):
        (Terrible) text extraction using
        that ugly swamp BeautifulSoup
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page)
        body = soup.findAll('body')
        body = body[0] if body else soup
        parElements = [p for p in body('p') if not p.attrs]
        brPars = [p for p in body(True) \
                  if p('br', recursive=False)]
        for node in brPars:
            for par in node.fetchText(True, recursive=False):
        paragraphs = []
        for paragraph in parElements:
            if isinstance(paragraph, unicode):
                text = paragraph
                text = ' '.join(paragraph.fetchText(True))
            text = self.whiteRE.sub(' ', text.strip())
        title = soup.find('title').string
        return title, '\n'.join(paragraphs)

This method looks for text under p elements, and text that is the sibling of br elements. A few ad hoc constraints were added from manual testing. This is the part of the program I’m least happy with, and I’d love to hear about a more principled solution to finding the body text of an arbitrary page.

I’m also quite unhappy with the implementation. Initially I wrote this with pyquery instead of BeautifulSoup, but Google App Engine does not support lxml. Boo and hiss by starring this issue. I really don’t like BeautifulSoup. It’s slow, and I find the maze of aliases in the interface incredibly annoying.

Addressing Zipf’s Law

The pre-processing and summarisation code presented so far is enough to do the job. However, we can improve the accuracy of the summariser by writing in just a touch more linguistic sensitivity.  Most words in a language are rare, which can be problematic when your application works by counting word frequencies. To be more precise, word frequencies follow a power law distribution, an observation known as Zipf’s law. This means that our frequency distributions will generally be quite sparse, making lexical frequencies less meaningful. There are a few standard tactics we can deploy to address this.

Standardising Word Forms

The first set of tweaks are stemming, stopping and case normalisation — processes I collectively refer to as “standardisation”. Stemming is the removal of inflexional and derivational suffixes. For instance, both archeologist and archeology would get stemmed to archeolog. This might look weird, but it means the terms will match, allowing us to count them together as one term. Stemming loses some precision, but on the whole it’s often beneficial when you’re trying to model the document’s topic. I used the nltk implementation of Porter‘s algorithm. The (ugly) source for this can be viewed here. Stopping addresses the other side of the problem caused by the unbalanced frequency distribution of words. Grammatical terms, such as “the” and “of”, are far more frequent than the interesting content words. It is therefore useful to discard them from our counts. I do this with the stop list provided with nltk. Finally, another possibility is to normalise away case variation. I’ve chosen not to do this, because I don’t want to confuse proper nouns with common nouns. Instead I just lower case the first word of the sentence. This will be wrong for sentences that start with proper nouns, but I didn’t want to use a more complicated process, such as part-of-speech tagging. I implement the standardise processes as a private class inside a module, with only an instance made public.. This centralises them, so that I can be sure all tokens are being standardised in the same way.

import _stops
import nltk.stem.porter as _porter

class _Standardise(object):
    def __init__(self):
        self.stemmer = lambda word: word
        self.stopWords = {}
        self.lower = False

    def __call__(self, string):
        if string.lower() in self.stopWords:
            return None
        if self.lower:
            string = string.lower()
        stemmed = self.stemmer(string)
        return stemmed

    def config(self, **kwargs):
        stemmer = kwargs.pop('stemming')
        if stemmer == 'porter':
            self.stemmer = _porter.PorterStemmer().stem_word
            raise StandardError, "Unknown stemmer", stemmer
        if kwargs.pop('stopping'):
            self.stopWords = _stops.nltkStops
        self.lower = bool(kwargs.pop('lower'))

standardise = _Standardise()

TF-IDF Weighting

There’s actually a way that we can use the imbalanced frequency distribution of word occurrences to our advantage, to discover what a document is really about more reliably. Imagine we have a document where two words occur 5 times each: professional and archeologist. It seems reasonable to assume that archeologist is a much stronger clue about the document’s topic, and we should weight sentences that contain this word higher so that hopefully they make it into the summary.

The reason archeologist is a better indication is that we are more surprised to see it. What we want, then, are statistics about how often a word appears in the language as a whole. Word which are rare overall will be weighted highly; words which are common will receive a low weight. This is called TF-IDF weighting, for term frequency*inverse document frequency. The weight is calculated by taking the log of the inverse percentage of documents in a sample that the word has occurred in. I used a sample of 800,000 pages from the English Wikipedia to determine the document frequencies. The log is taken to avoid having the IDF part dominate the term frequency part. Consider that we would be weighting single occurrence words by 800,000 if we used a linear weight.

Apart from some boring code to read in the document frequencies and turn them into IDF weights, the TF-IDF weighted summariser simply overrides the similarity method of the Summariser:

class TFIDFSummariser(Summariser):
    Scores sentences by TF-IDF weighted token frequencies
    def __init__(self):
        # The number of documents the frequencies were drawn from.
        n = 818741.0
        self.idfs = self._loadIDFs(n)

    def _loadIDFs(self, n):
        dfLoc = localPath('wiki_doc_freqs.txt')
        dfs = collections.defaultdict(int)
        # Convenience for
        lines = utf8open(dfLoc).read().strip().split('\n')
        # Read in the document freqs.
        # Have to do this first because we collapse some freqs
        # through standardisation.
        for line in lines:
            token, freq = line.split('\t')
            token = standardise(token)
            if token:
                dfs[token] += int(freq)
        # Turn the frequencies into IDF weights.
        idfs = collections.defaultdict(float)
        for token, freq in dfs.items():
            idf = log(n/freq, 10)
            idfs[token] = idf
        return idfs

    def similarity(self, docBow, sentBow):
        idfs = self.idfs
        # Apply the IDF weight to each term.
        return sum([f*(docBow[t]-f)*idfs[t] for t, f in sentBow.items()])

This is a little less efficient than applying the IDF weights to the document vector directly, since we may have to weigh the same term multiple times for different sentences. However, I doubt this will make much difference to overall running time, and it allows a design I think is a little nicer, as the variation is confined to the similarity method.

That’s it for now…

Well, that’s the simple extractive summariser, and a tweaked one with TF-IDF. If there’s sufficient interest, I’ll follow this up with a little evaluation experiment.

Summarisation is difficult to evaluate, because quality is subjective, but I have a plan for what seems to me to be a reasonable approximation.

2009/11/18 Posted by | Uncategorized | 8 Comments