Computational Linguistics

Demystifying NLP

Parsing English with 500 lines of Python

A syntactic parser describes a sentence’s grammatical structure, to help another application reason about it. Natural languages introduce many unexpected ambiguities, which our world-knowledge immediately filters out. A favourite example:

They ate the pizza with anchovies

A correct parse links “with” to “pizza”, while an incorrect parse links “with” to “eat”:

The Natural Language Processing (NLP) community has made big progress in syntactic parsing over the last few years. It’s now possible for a tiny Python implementation to perform better than the widely-used Stanford PCFG parser:

Update! The Stanford CoreNLP library now includes a greedy transition-based dependency parser, similar to the one described in this post, but with an improved learning strategy. It is much faster and more accurate than this simple Python implementation.

Parser Accuracy Speed (w/s) Language LOC
Stanford 89.6% 19 Java > 50,000[1] 89.8% 2,020 Python ~500
Redshift 93.6% 2,580 Cython ~4,000

The rest of the post sets up the problem, and then takes you through a concise implementation, prepared for this post. The first 200 lines of, the part-of-speech tagger and learner, are described here. You should probably at least skim that post before reading this one, unless you’re very familiar with NLP research.

The Cython system, Redshift, was written for my current research. I plan to improve it for general use in June, after my contract ends at Macquarie University. The current version is hosted on GitHub.

Problem Description

It’d be nice to type an instruction like this into your phone:

Set volume to zero when I’m in a meeting, unless John’s school calls.

And have it set the appropriate policy. On Android you can do this sort of thing with Tasker, but an NL interface would be much better. It’d be especially nice to receive a meaning representation you could edit, so you could see what it thinks you said, and correct it.

There are lots of problems to solve to make that work, but some sort of syntactic representation is definitely necessary. We need to know that:

Unless John’s school calls, when I’m in a meeting, set volume to zero

is another way of phrasing the first instruction, while:

Unless John’s school, call when I’m in a meeting

means something completely different.

A dependency parser returns a graph of word-word relationships, intended to make such reasoning easier. Our graphs will be trees — edges will be directed, and every node (word) will have exactly one incoming arc (one dependency, with its head), except one.

Example usage:

    >>> parser = parser.Parser()
    >>> tokens = "Set the volume to zero when I 'm in a meeting unless John 's school calls".split()
    >>> tags, heads = parser.parse(tokens)
    >>> heads
    [-1, 2, 0, 0, 3, 0, 7, 5, 7, 10, 8, 0, 13, 15, 15, 11]
    >>> for i, h in enumerate(heads): 
    ...   head = tokens[heads[h]] if h >= 1 else 'None'
    ...   print(tokens[i] + ' <-- ' + head])
    Set <-- None
    the <-- volume
    volume <-- Set
    to <-- Set
    zero <-- to
    when <-- Set
    I <-- 'm
    'm <-- when
    in <-- 'm
    a <-- meeting
    meeting <-- in
    unless <-- Set
    John <-- 's
    's   <-- calls
    school <-- calls
    calls <-- unless

The idea is that it should be slightly easier to reason from the parse, than it was from the string. The parse-to-meaning mapping is hopefully simpler than the string-to-meaning mapping.

The most confusing thing about this problem area is that “correctness” is defined by convention — by annotation guidelines. If you haven’t read the guidelines and you’re not a linguist, you can’t tell whether the parse is “wrong” or “right”, which makes the whole task feel weird and artificial.

For instance, there’s a mistake in the parse above: “John’s school calls” is structured wrongly, according to the Stanford annotation guidelines. The structure of that part of the sentence is how the annotators were instructed to parse an example like “John’s school clothes”.

It’s worth dwelling on this point a bit. We could, in theory, have written our guidelines so that the “correct” parses were reversed. There’s good reason to believe the parsing task will be harder if we reversed our convention, as it’d be less consistent with the rest of the grammar.[2] But we could test that empirically, and we’d be pleased to gain an advantage by reversing the policy.

We definitely do want that distinction in the guidelines — we don’t want both to receive the same structure, or our output will be less useful. The annotation guidelines strike a balance between what distinctions downstream applications will find useful, and what parsers will be able to predict easily.

Projective trees

There’s a particularly useful simplification that we can make, when deciding what we want the graph to look like: we can restrict the graph structures we’ll be dealing with. This doesn’t just give us a likely advantage in learnability; it can have deep algorithmic implications. We follow most work on English in constraining the dependency graphs to be projective trees:

  1. Tree. Every word has exactly one head, except for the dummy ROOT symbol.
  2. Projective. For every pair of dependencies (a1, a2) and (b1, b2), if a1 < b2, then a2 >= b2. In other words, dependencies cannot “cross”. You can’t have a pair of dependencies that goes a1 b1 a2 b2, or b1 a1 b2 a2.

There’s a rich literature on parsing non-projective trees, and a smaller literature on parsing DAGs. But the parsing algorithm I’ll be explaining deals with projective trees.

Greedy transition-based parsing

Our parser takes as input a list of string tokens, and outputs a list of head indices, representing edges in the graph. If the ith member of heads is j, the dependency parse contains an edge (j, i). A transition-based parser is a finite-state transducer; it maps an array of N words onto an output array of N head indices:

start MSNBC reported that Facebook bought WhatsApp for $16bn root
0 2 9 2 4 2 4 4 7 0

The heads array denotes that the head of MSNBC is reported: MSNBC is word 1, and reported is word 2, and heads[1] == 2. You can already see why parsing a tree is handy — this data structure wouldn’t work if we had to output a DAG, where words may have multiple heads.

Although heads can be represented as an array, we’d actually like to maintain some alternate ways to access the parse, to make it easy and efficient to extract features. Our Parse class looks like this:

    class Parse(object):
        def __init__(self, n):
            self.n = n
            self.heads = [None] * (n-1)
            self.lefts = []
            self.rights = []
            for i in range(n+1):

        def add_arc(self, head, child):
            self.heads[child] = head
            if child &amp;amp;lt; head:

As well as the parse, we also have to keep track of where we’re up to in the sentence. We’ll do this with an index into the words array, and a stack, to which we’ll push words, before popping them once their head is set. So our state data structure is fundamentally:

  • An index, i, into the list of tokens;
  • The dependencies added so far, in Parse
  • A stack, containing words that occurred before i, for which we’re yet to assign a head.

Each step of the parsing process applies one of three actions to the state:

    SHIFT = 0; RIGHT = 1; LEFT = 2

    def transition(move, i, stack, parse):
        global SHIFT, RIGHT, LEFT
        if move == SHIFT:
            return i + 1
        elif move == RIGHT:
            parse.add_arc(stack[-2], stack.pop())
            return i
        elif move == LEFT:
            parse.add_arc(i, stack.pop())
            return i
        raise GrammarError("Unknown move: %d" % move)

The LEFT and RIGHT actions add dependencies and pop the stack, while SHIFT pushes the stack and advances i into the buffer.

So, the parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the (valid) actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. (It’s hard to understand this sort of algorithm without stepping through it. Try coming up with a sentence, drawing a projective parse tree over it, and then try to reach the parse tree by choosing the right sequence of transitions.)

Here’s what the parsing loop looks like in code:

    class Parser(object):
        def parse(self, words):
            tags = self.tagger(words)
            n = len(words)
            idx = 1
            stack = [0]
            deps = Parse(n)
            while stack or idx &amp;amp;lt; n:
                features = extract_features(words, tags, idx, n, stack, deps)
                scores = self.model.score(features)
                valid_moves = get_valid_moves(i, n, len(stack))
                next_move = max(valid_moves, key=lambda move: scores[move])
                idx = transition(next_move, idx, stack, parse)
            return tags, parse

    def get_valid_moves(i, n, stack_depth):
        moves = []
        if i &amp;amp;lt; n:
        if stack_depth <= 2:
        if stack_depth <= 1:
        return moves

We start by tagging the sentence, and initializing the state. We then map the state to a set of features, which we score using a linear model. We then find the best-scoring valid move, and apply it to the state.

The model scoring works the same as it did in the POS tagger. If you’re confused about the idea of extracting features and scoring them with a linear model, you should review that post. Here’s a reminder of how the model scoring works:

    class Perceptron(object)
        def score(self, features):
            all_weights = self.weights
            scores = dict((clas, 0) for clas in self.classes)
            for feat, value in features.items():
                if value == 0:
                if feat not in all_weights:
                weights = all_weights[feat]
                for clas, weight in weights.items():
                    scores[clas] += value * weight
            return scores

It’s just summing the class-weights for each feature. This is often expressed as a dot-product, but when you’re dealing with multiple classes, that gets awkward, I find.

The beam parser (RedShift) tracks multiple candidates, and only decides on the best one at the very end. We’re going to trade away accuracy in favour of efficiency and simplicity. We’ll only follow a single analysis. Our search strategy will be entirely greedy, as it was with the POS tagger. We’ll lock-in our choices at every step.

If you read the POS tagger post carefully, you might see the underlying similarity. What we’ve done is mapped the parsing problem onto a sequence-labelling problem, which we address using a “flat”, or unstructured, learning algorithm (by doing greedy search).


Feature extraction code is always pretty ugly. The features for the parser refer to a few tokens from the context::

  • The first three words of the buffer (n0, n1, n2)
  • The top three words of the stack (s0, s1, s2)
  • The two leftmost children of s0 (s0b1, s0b2);
  • The two rightmost children of s0 (s0f1, s0f2);
  • The two leftmost children of n0 (n0b1, n0b2)

For these 12 tokens, we refer to the word-form, the part-of-speech tag, and the number of left and right children attached to the token.

Because we’re using a linear model, we have our features refer to pairs and triples of these atomic properties.

def extract_features(words, tags, n0, n, stack, parse):
    def get_stack_context(depth, stack, data):
        if depth >= 3:
            return data[stack[-1]], data[stack[-2]], data[stack[-3]]
        elif depth >= 2:
            return data[stack[-1]], data[stack[-2]], ''
        elif depth == 1:
            return data[stack[-1]], '', ''
            return '', '', ''

    def get_buffer_context(i, n, data):
        if i + 1 >= n:
            return data[i], '', ''
        elif i + 2 >= n:
            return data[i], data[i + 1], ''
            return data[i], data[i + 1], data[i + 2]

    def get_parse_context(word, deps, data):
        if word == -1:
            return 0, '', ''
        deps = deps[word]
        valency = len(deps)
        if not valency:
            return 0, '', ''
        elif valency == 1:
            return 1, data[deps[-1]], ''
            return valency, data[deps[-1]], data[deps[-2]]

    features = {}
    # Set up the context pieces --- the word, W, and tag, T, of:
    # S0-2: Top three words on the stack
    # N0-2: First three words of the buffer
    # n0b1, n0b2: Two leftmost children of the first word of the buffer
    # s0b1, s0b2: Two leftmost children of the top word of the stack
    # s0f1, s0f2: Two rightmost children of the top word of the stack

    depth = len(stack)
    s0 = stack[-1] if depth else -1

    Ws0, Ws1, Ws2 = get_stack_context(depth, stack, words)
    Ts0, Ts1, Ts2 = get_stack_context(depth, stack, tags)

    Wn0, Wn1, Wn2 = get_buffer_context(n0, n, words)
    Tn0, Tn1, Tn2 = get_buffer_context(n0, n, tags)

    Vn0b, Wn0b1, Wn0b2 = get_parse_context(n0, parse.lefts, words)
    Vn0b, Tn0b1, Tn0b2 = get_parse_context(n0, parse.lefts, tags)

    Vn0f, Wn0f1, Wn0f2 = get_parse_context(n0, parse.rights, words)
    _, Tn0f1, Tn0f2 = get_parse_context(n0, parse.rights, tags)

    Vs0b, Ws0b1, Ws0b2 = get_parse_context(s0, parse.lefts, words)
    _, Ts0b1, Ts0b2 = get_parse_context(s0, parse.lefts, tags)

    Vs0f, Ws0f1, Ws0f2 = get_parse_context(s0, parse.rights, words)
    _, Ts0f1, Ts0f2 = get_parse_context(s0, parse.rights, tags)

    # Cap numeric features at 5? 
    # String-distance
    Ds0n0 = min((n0 - s0, 5)) if s0 != 0 else 0

    features['bias'] = 1
    # Add word and tag unigrams
    for w in (Wn0, Wn1, Wn2, Ws0, Ws1, Ws2, Wn0b1, Wn0b2, Ws0b1, Ws0b2, Ws0f1, Ws0f2):
        if w:
            features['w=%s' % w] = 1
    for t in (Tn0, Tn1, Tn2, Ts0, Ts1, Ts2, Tn0b1, Tn0b2, Ts0b1, Ts0b2, Ts0f1, Ts0f2):
        if t:
            features['t=%s' % t] = 1

    # Add word/tag pairs
    for i, (w, t) in enumerate(((Wn0, Tn0), (Wn1, Tn1), (Wn2, Tn2), (Ws0, Ts0))):
        if w or t:
            features['%d w=%s, t=%s' % (i, w, t)] = 1

    # Add some bigrams
    features['s0w=%s,  n0w=%s' % (Ws0, Wn0)] = 1
    features['wn0tn0-ws0 %s/%s %s' % (Wn0, Tn0, Ws0)] = 1
    features['wn0tn0-ts0 %s/%s %s' % (Wn0, Tn0, Ts0)] = 1
    features['ws0ts0-wn0 %s/%s %s' % (Ws0, Ts0, Wn0)] = 1
    features['ws0-ts0 tn0 %s/%s %s' % (Ws0, Ts0, Tn0)] = 1
    features['wt-wt %s/%s %s/%s' % (Ws0, Ts0, Wn0, Tn0)] = 1
    features['tt s0=%s n0=%s' % (Ts0, Tn0)] = 1
    features['tt n0=%s n1=%s' % (Tn0, Tn1)] = 1

    # Add some tag trigrams
    trigrams = ((Tn0, Tn1, Tn2), (Ts0, Tn0, Tn1), (Ts0, Ts1, Tn0), 
                (Ts0, Ts0f1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Tn0, Tn0b1),
                (Ts0, Ts0b1, Ts0b2), (Ts0, Ts0f1, Ts0f2), (Tn0, Tn0b1, Tn0b2),
                (Ts0, Ts1, Ts1))
    for i, (t1, t2, t3) in enumerate(trigrams):
        if t1 or t2 or t3:
            features['ttt-%d %s %s %s' % (i, t1, t2, t3)] = 1

    # Add some valency and distance features
    vw = ((Ws0, Vs0f), (Ws0, Vs0b), (Wn0, Vn0b))
    vt = ((Ts0, Vs0f), (Ts0, Vs0b), (Tn0, Vn0b))
    d = ((Ws0, Ds0n0), (Wn0, Ds0n0), (Ts0, Ds0n0), (Tn0, Ds0n0),
        ('t' + Tn0+Ts0, Ds0n0), ('w' + Wn0+Ws0, Ds0n0))
    for i, (w_t, v_d) in enumerate(vw + vt + d):
        if w_t or v_d:
            features['val/d-%d %s %d' % (i, w_t, v_d)] = 1
    return features


Weights are learned using the same algorithm, averaged perceptron, that we used for part-of-speech tagging. Its key strength is that it’s an online learning algorithm: examples stream in one-by-one, we make our prediction, check the actual answer, and adjust our beliefs (weights) if we were wrong.

The training loop looks like this:

class Parser(object):
    def train_one(self, itn, words, gold_tags, gold_heads):
        n = len(words)
        i = 2; stack = [1]; parse = Parse(n)
        tags = self.tagger.tag(words)
        while stack or (i + 1) &amp;amp;lt; n:
            features = extract_features(words, tags, i, n, stack, parse)
            scores = self.model.score(features)
            valid_moves = get_valid_moves(i, n, len(stack))
            guess = max(valid_moves, key=lambda move: scores[move])
            gold_moves = get_gold_moves(i, n, stack, parse.heads, gold_heads)
            best = max(gold_moves, key=lambda move: scores[move])
        self.model.update(best, guess, features)
        i = transition(guess, i, stack, parse)
    # Return number correct
    return len([i for i in range(n-1) if parse.heads[i] == gold_heads[i]])

The most interesting part of the training process is in get_gold_moves. The performance of our parser is made possible by an advance by Goldberg and Nivre (2012), who showed that we’d been doing this wrong for years.

In the POS-tagging post, I cautioned that during training you need to make sure you pass in the last two predicted tags as features for the current tag, not the last two gold tags. At test time you’ll only have the predicted tags, so if you base your features on the gold sequence during training, your training contexts won’t resemble your test-time contexts, so you’ll learn the wrong weights.

In parsing, the problem was that we didn’t know how to pass in the predicted sequence! Training worked by taking the gold-standard tree, and finding a transition sequence that led to it. i.e., you got back a sequence of moves, with the guarantee that if you followed those moves, you’d get the gold-standard dependencies.

The problem is, we didn’t know how to define the “correct” move to teach a parser to make if it was in any state that wasn’t along that gold-standard sequence. Once the parser had made a mistake, we didn’t know how to train from that example.

That was a big problem, because it meant that once the parser started making mistakes, it would end up in states unlike any in its training data — leading to yet more mistakes.
The problem was specific to greedy parsers: once you use a beam, there’s a natural way to do structured prediction.

The solution seems obvious once you know it, like all the best breakthroughs. What we do is define a function that asks “How many gold-standard dependencies can be recovered from this state?”. If you can define that function, then you can apply each move in turn, and ask, “How many gold-standard dependencies can be recovered from this state?”. If the action you applied allows fewer gold-standard dependencies to be reached, then it is sub-optimal.

That’s a lot to take in.

So we have this function Oracle(state):

Oracle(state) = | gold_arcs ∩ reachable_arcs(state) |

We also have a set of actions, each of which returns a new state. We want to know:

  • shift_cost = Oracle(state) – Oracle(shift(state))
  • right_cost = Oracle(state) – Oracle(right(state))
  • left_cost = Oracle(state) – Oracle(left(state))

Now, at least one of those costs has to be zero. Oracle(state) is asking, “what’s the cost of the best path forward?”, and the first action of that best path has to be shift, right, or left.

It turns out that we can derive Oracle fairly simply for many transition systems. The derivation for the transition system we’re using, Arc Hybrid, is in Goldberg and Nivre (2013).

We’re going to implement the oracle as a function that returns the zero-cost moves, rather than implementing a function Oracle(state). This prevents us from doing a bunch of costly copy operations. Hopefully the reasoning in the code isn’t too hard to follow, but you can also consult Goldberg and Nivre’s papers if you’re confused and want to get to the bottom of this.

def get_gold_moves(n0, n, stack, heads, gold):
    def deps_between(target, others, gold):
        for word in others:
            if gold[word] == target or gold[target] == word:
                return True
        return False

    valid = get_valid_moves(n0, n, len(stack))
    if not stack or (SHIFT in valid and gold[n0] == stack[-1]):
        return [SHIFT]
    if gold[stack[-1]] == n0:
        return [LEFT]
    costly = set([m for m in MOVES if m not in valid])
    # If the word behind s0 is its gold head, Left is incorrect
    if len(stack) &amp;amp;gt;= 2 and gold[stack[-1]] == stack[-2]:
    # If there are any dependencies between n0 and the stack,
    # pushing n0 will lose them.
    if SHIFT not in costly and deps_between(n0, stack, gold):
    # If there are any dependencies between s0 and the buffer, popping
    # s0 will lose them.
    if deps_between(stack[-1], range(n0+1, n-1), gold):
    return [m for m in MOVES if m not in costly]

Doing this “dynamic oracle” training procedure makes a big difference to accuracy — typically 1-2%, with no difference to the way the run-time works. The old “static oracle” greedy training procedure is fully obsolete; there’s no reason to do it that way any more.


I have the sense that language technologies, particularly those relating to grammar, are particularly mysterious. I can imagine having no idea what the program might even do.

I think it therefore seems natural to people that the best solutions would be over-whelmingly complicated. A 200,000 line Java package feels appropriate.

But, algorithmic code is usually short, when only a single algorithm is implemented. And when you only implement one algorithm, and you know exactly what you want to write before you write a line, you also don’t pay for any unnecessary abstractions, which can have a big performance impact.


[1] I wasn’t really sure how to count the lines of code in the Stanford parser. Its jar file ships over 200k, but there are a lot of different models in it. It’s not important, but over 50k seems safe.

[2] For instance, how would you parse, “John’s school of music calls”? You want to make sure the phrase “John’s school” has a consistent structure in both “John’s school calls” and “John’s school of music calls”. Reasoning about the different “slots” you can put a phrase into is a key way we reason about what syntactic analyses look like. You can think of each phrase as having a different shaped connector, which you need to plug into different slots — which each phrase also has a certain number of, each of a different shape. We’re trying to figure out what connectors are where, so we can figure out how the sentences are put together.

[3] There’s an updated version of the Stanford parser that gets better accuracy, using a “deep learning” technique. But, the accuracy of the final model is still way behind the best shift-reduce parsers. It’s a great paper, and it doesn’t really matter that the idea was implemented on top of a parser that isn’t state-of-the-art. It seems very likely that the idea would still work on top of a shift-reduce parser, and I look forward to someone doing that.

[4] A point of detail: the Stanford dependencies are actually produced automatically given gold-standard phrase-structure trees. See the Stanford Dependency Converter page here:

Idle speculation

For a long time, incremental language processing algorithms were primarily of scientific interest. If you want to write a parser to test a theory about how the human sentence processor might work, well, that parser needs to build partial interpretations. There’s a wealth of evidence, including commonsense introspection, that establishes that we don’t buffer input and analyse it once the speaker has finished.

But now algorithms with that neat scientific feature are winning! As best as I can tell, the secret to that success is to be:

  • Incremental. Earlier words constrain the search.
  • Error-driven. Training involves a working hypothesis, which is updated as it makes mistakes.

The links to human sentence processing seem tantalising. I look forward to seeing whether these engineering breakthroughs lead to any psycholinguistic advances.


The NLP literature is almost entirely open access. All of the relavant papers can be found here:

The parser I’ve described is an implementation of the dynamic-oracle Arc-Hybrid system here:

Goldberg, Yoav; Nivre, Joakim
Training Deterministic Parsers with Non-Deterministic Oracles
TACL 2013

However, I wrote my own features for it. The arc-hybrid system was originally described here:

Kuhlmann, Marco; Gomez-Rodriguez, Carlos; Satta, Giorgio
Dynamic programming algorithms for transition-based dependency parsers
ACL 2011

The dynamic oracle training method was first described here:

A Dynamic Oracle for Arc-Eager Dependency Parsing
Goldberg, Yoav; Nivre, Joakim

This work depended on a big break-through in accuracy for transition-based parsers, when beam-search was properly explored by Zhang and Clark. They have several papers, but the preferred citation is:

Zhang, Yue; Clark, Steven
Syntactic Processing Using the Generalized Perceptron and Beam Search
Computational Linguistics 2011 (1)

Another important paper was this little feature engineering paper, which further improved the accuracy:

Zhang, Yue;  Nivre, Joakim
Transition-based Dependency Parsing with Rich Non-local Features
ACL 2011

The generalised perceptron, which is the learning framework for these beam parsers, is from this paper:

Collins, Michael
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms
EMNLP 2002

Experimental details

The results at the start of the post refer to Section 22 of the Wall Street Journal corpus. The Stanford parser was run as follows:

java -mx10000m -cp &amp;amp;quot;$scriptdir/*:&amp;amp;quot; edu.stanford.nlp.parser.lexparser.LexicalizedParser \
-outputFormat &amp;amp;quot;penn&amp;amp;quot; edu/stanford/nlp/models/lexparser/englishFactored.ser.gz $*

A small post-process was applied, to undo the fancy tokenisation Stanford adds for numbers, to make them match the PTB tokenisation:

&amp;amp;quot;&amp;amp;quot;&amp;amp;quot;Stanford parser retokenises numbers. Split them.&amp;amp;quot;&amp;amp;quot;&amp;amp;quot;
import sys
import re

qp_re = re.compile('\xc2\xa0')
for line in sys.stdin:
    line = line.rstrip()
        line = line.replace('(CD', '(QP (CD', 1) + ')'
        line = line.replace('\xc2\xa0', ') (CD ')
    print line

The resulting PTB-format files were then converted into dependencies using the Stanford converter:

for f in $1/*.mrg; do
  echo $f
  grep -v CODE $f &amp;amp;amp;gt; &amp;amp;quot;$f.2&amp;amp;quot;
  java -mx800m -cp &amp;amp;quot;$scriptdir/*:&amp;amp;quot; edu.stanford.nlp.trees.EnglishGrammaticalStructure \
   -treeFile &amp;amp;quot;$f.2&amp;amp;quot; -basic -makeCopulaHead -conllx &amp;amp;amp;gt; $out

I can’t easily read that anymore, but it should just convert every .mrg file in a folder to a CoNLL-format Stanford basic dependencies file, using the settings common in the dependency literature.

I then converted the gold-standard trees from WSJ 22, for the evaluation. Accuracy scores refer to unlabelled attachment score (i.e. the head index) of all non-punctuation tokens.

To train, I fed the gold-standard PTB trees for WSJ 02-21 into the same conversion script.

In a nutshell: The Stanford model and are trained on the same set of sentences, and they each make their predictions on a held-out test set, for which we know the answers. Accuracy refers to how many of the words’ heads we got correct.

Speeds were measured on a 2.4Ghz Xeon. I ran the experiments on a server, to give the Stanford parser more memory. The system runs fine on my MacBook Air. I used PyPy for the experiments; CPython was about half as fast on an early benchmark.

One of the reasons is so fast is that it does unlabelled parsing. Based on previous experiments, a labelled parser would likely be about 40x slower, and about 1% more accurate. Adapting the program to labelled parsing would be a good exercise for the reader, if you have access to the data.

The result from the Redshift parser was produced from commit b6b624c9900f3bf, which was run as follows:

./scripts/ -x zhang+stack -k 8 -p ~/data/stanford/train.conll ~/data/parsers/tmp
./scripts/ ~/data/parsers/tmp ~/data/stanford/devi.txt /tmp/parse/
./scripts/ /tmp/parse/parses ~/data/stanford/dev.conll

2013/12/18 - Posted by | Uncategorized


  1. Fantastic work and exposition, thanks for being so clear!

    Comment by Jesus Lopez | 2014/04/11 | Reply

  2. Having struggled with Stanford parser for a while, this is pure gold. Thank you so much.

    PS: you might want to escape those quotes in the second line of the code example 🙂

    Comment by Luca Soldaini | 2014/04/29 | Reply

  3. Great article– I won’t lie I did not finish it yet- but I’ve put it on my to-read list for tomorrow… I’ve been working with python-NLTK, maltparser, (and various other tools) – developing NLP tools and little webapps with the goal of assiting people with learning disabilities in language giving them an alternitive way to help then master language (I, myself am included in this group of individuals). [The tools I currently have online are all experiments not polished tools]

    I will definitly be spending time looking over your blog 🙂

    Keep up the good work!

    Comment by Jon Klopfer | 2014/04/29 | Reply

    • Ah great — you’ll find that using tools you really know the internals of, with your own glue, is actually easier, as well as much more flexible!

      For tokenisation, I recommend the Splitta Python library.

      Comment by honnibal | 2014/04/29 | Reply

  4. Regarding benchmarks: Super fast “greedy” mode: over 1,000 sentences per second at 91.5% accuracy and the slowest mode is 100 sentences per second. How many cores do you use to benchmark?

    Comment by Leonid Boytsov (@srchvrs) | 2014/04/29 | Reply

    • Single core, for all settings. All parsers process each sentence independently, so can parallelise on the sentences in the same way.

      Comment by honnibal | 2014/04/29 | Reply

  5. So, this is 400 sentences in the slow mode and 1000 sentences in the greedy mode on a reasonably new hardware. Interesting, thank you for the clarification!

    Comment by Leonid Boytsov (@srchvrs) | 2014/04/29 | Reply

  6. Hi,
    Great article. I was wondering if you knew any easy way to say convert the PTB corpus from nltk into a usable format for ? A toy data example might also be great.

    Comment by Artyom | 2014/04/30 | Reply

    • I guess you mean to train a model. If so, I have been trying with this tool ( Works fine to convert from PTB to CONLL, but when trying to train the model with the converted output, I get an error:

      Traceback (most recent call last):
      File “./scripts/”, line 54, in
      File “/Library/Python/2.7/site-packages/plac-0.9.1-py2.7.egg/”, line 309, in call
      cmd, result = parser_from(obj).consume(arglist)
      File “/Library/Python/2.7/site-packages/plac-0.9.1-py2.7.egg/”, line 195, in consume
      return cmd, self.func(*(args + varargs + extraopts), **kwargs)
      File “./scripts/”, line 49, in main
      parser.train(train_data, n_iter=n_iter)
      File “parser.pyx”, line 134, in redshift.parser.BaseParser.train (redshift/parser.cpp:4378)
      File “parser.pyx”, line 433, in redshift.parser.GreedyParser.static_train (redshift/parser.cpp:8647)
      File “transitions.pyx”, line 165, in redshift.transitions.TransitionSystem.transition (redshift/transitions.cpp:3186)
      StandardError: 77

      Comment by Rodrigo Alarcón (@ralarconm) | 2014/05/01 | Reply

      • At a glance, I would guess:

        I think the LTH converter produces non-projective trees for a minority of constructions. This violates an assumption of the transition system/training oracle. The implementation (and underlying algorithm) only works for projective trees.

        Comment by honnibal | 2014/05/01

      • @honnibal, thanks for the hint. Training works with trees formatted to CONLL with the StanfordParser, as you described in Experimental Details…

        Comment by Rodrigo Alarcón (@ralarconm) | 2014/05/02

    • Well, the LDC claims that OntoNotes 5 is free for non-members. I’m asking them whether I can distribute it. In the meantime, you can go through their sign-up rigmarole and obtain the corpus.

      Comment by honnibal | 2014/05/01 | Reply

  7. Pushing Ontonotes 5 through Jinho Choi’s dependency converter produces a small fraction of non-projective trees. I’ve got some preliminary results with malt and MaltOptimizer for a set where each partition is a superset of the corresponding WSJ partition. Would like
    to compare. As far as I know, there is no standard split for the non-WSJ parts of ontonotes. Does anyone know of one? Not an advocate of standard splits, but nonetheless reviewers want them.

    Comment by Chris Brew | 2014/06/05 | Reply

    • I don’t think the community’s settled on a standard split for the other OntoNotes sections. I see people do cross-fold validation on those sometimes. Personally I don’t like cross-fold, as I think it makes it easier to make mistakes.

      If you’re comparing the PTB WSJ and OntoNotes WSJ, you might want to use the Vadas and Curran NP-bracket annotations for the PTB before you feed the corpus through the converter. As far as I know OntoNotes has full NP-bracketing, which is one of the reasons parsers score lower on it.

      Comment by honnibal | 2014/06/05 | Reply

  8. Thanks for a great overview. Your explanation of the Oracle idea to improve greedy parsing was very clear. On beam-search you mention: “This work depended on a big break-through in accuracy for transition-based parsers, when beam-search was properly explored by Zhang and Clark.” Could you elaborate on the exact nature of the break-through? The only non-standard trick they use seems to be the early-update strategy of Collins and Roark (2004). Is there some other idea (other than great feature engineering) that makes your 93.6 possible?

    Comment by Deniz Yuret | 2014/07/04 | Reply

    • The Zhang and Nivre (2011) version of the Zhang and Clark beam search parser really could have been done in 2004.

      I think it’s quite interesting, really.

      The way I read the history — and I wasn’t publishing at the time, mind! — is that the MALT parser was interesting but low accuracy, and was seen as something that you did for languages other than English, or when you really wanted speed over accuracy.

      It wasn’t until 2008 that Zhang and Clark implemented the beam-search with the parser properly, with a global model and early update (know since Collins (2002)). The contribution in the Zhang and Clark (2011) CL paper is “hey this beam-search framework works really well on a variety of problems”, which I think is a very interesting finding. I think there are still lots of problems just waiting to have that model run on them.

      But, the accuracy of the model was still quite low, because they naturally just adopted Nivre’s features, figuring he and his grad students had optimised them well. And they had: on the greedy parser with the _static_ oracle!

      The catch is that the static oracle punishes you for conditioning too much on your parse state, because your parse state will be different at training and test time. So if you add higher-order features to the static model, you risk your accuracy _decreasing_.

      So, Zhang and Nivre (2011) publish this little short paper with extra features, and suddenly the accuracy of their model is very strong.

      The story actually continues a little, but this part hasn’t made it into the publication record yet. Goldberg and Nivre ran their experiments with the Zhang and Nivre feature set, which has been tuned for beam-search parsing. The beam setting is more forgiving for feature engineering, because even if you have constructions which are not locally decidable with your feature set, you can rely on the beam a little.

      I said that a bit confusingly, but the point is just this: the greedy model gets no second chances, so actually needs more features. I’ve chatted to Jin-ho Choi about this, and he’s found he can get the greedy model up to about 92% accuracy. I’ve found something similar. The same features help the beam parser with narrow beams, but are only slightly useful at wide beams.

      All up, I’d say the story is this: the narrower your search, the more you rely on feature engineering. But, feature engineering _can_ compensate for narrow search, and when you get the features right, you get better results than broad-search, narrow features.

      Come to think of it, the MALT parser used a polynomial kernel, didn’t it? A similar story is at work with linear models.

      Comment by honnibal | 2014/07/05 | Reply

  9. I noticed you used arc-hybrid in and arc-eager in redshift. Any reason to choose one vs the other?

    Comment by denizyuret | 2014/07/05 | Reply

    • Everything about the blog post implementation is optimized for ease of explanation, not accuracy or efficiency.

      Redshift has whatever I needed for my experiments, for whatever paper I was working on at the time. It made sense to replicate the Goldberg and Nivre (2012) result for my CoNLL paper with Yoav and Mark. I then replicated the Zhang and Nivre (2011) beam result when I was working on my TACL 2014 paper. I’ve been using arc-hybrid beam parsing lately, because I find it easy to think about the dynamic oracle for that system.

      I think the accuracy advantage we’re seeing for arc-eager is probably just feature tuning, and arc-hybrid can be just as good. And it’s much simpler to explain, because words never sit on the stack with heads already assigned to them, and there are no pre-conditions.

      On the other hand, maybe I’m wrong. I still haven’t gotten the arc-hybrid accuracy quite to the level of the arc-eager. Maybe it really is a good idea to make a label prediction early to use as a feature, and to create dependencies between the same pair of context tokens (S0 and N0), where the arc-hybrid creates between (S1, S0) and (S0, N0).

      Comment by honnibal | 2014/07/06 | Reply

  10. Fantastic Posting! (which I’m now trying to re-implement in Scala, ‘just because’)

    In your gist for this, it looks like (in lines 441-451) that you train the POS and Dependency parser at the same time, except : (a) the POS tagger is only trained for the first 5 of the 15 overall iterations; and (b) the Dependency parser is always using GoldTags for learning (even though it uses the POS tagger output for its parsing features).

    Is there any benefit to training the two simultaneously at the beginning (or is there some interplay between them)? Or might it make sense to train the POS tagger first, and then use its output tags in the training of the Dependency parser?

    One thing that’s had me confused is that the Perceptron.weights are doing double-duty : During training they relate to the most current value in the search for a good classifier (and are naturally ints), whereas once they’re averaged, they really stand for something other than the ‘bleeding edge’ (and will be floats, which then entirely define the state of the Perceptron model in the pickle).

    Thanks again for writing this up : It’s excellent to be able to see something boiled down to its essence by an expert.

    Comment by Martin Andrews | 2014/08/01 | Reply

    • You’re right that they’re trained simultaneously, but not that the parser is trained from gold_tags. The implementation is a bit misleading about this — I’m passing gold_tags to Parser.train_one, but it’s not used.

      Training the two models “online” like this is a bit non-standard. The usual approach is to cross-fold train the tagger, predicting on the folds excluded from training, so that you have a set of predicted tags for the parser training data. Note that you don’t want to train the tagger, and then predict on its own training data — those tags will be more accurate than the tags you’ll see at test time.

      Training the two together allows the parser to see the tagger making mistakes, as the tagger hasn’t achieved high accuracy yet. I’ve found it’s effective enough, and it’s much simpler, imo. I don’t like having intermediate data-files in my experiments. I try to avoid relying on file-system state.

      Finally, the nice thing about averaging the weights in-place is that you get to call the same Parser.predict function during training and run-time, and you don’t have to branch on a flag like Parser.is_trained within Parser.predict.

      Comment by honnibal | 2014/08/01 | Reply

  11. […] “There are a number of deep learning packages out there. However most sacrifice readability for efficiency. This has two disadvantages: (1) It is difficult for a beginner student to understand what the code is doing, which is a shame because sometimes the code can be a lot simpler than the underlying math. (2) Every other day new ideas come out for optimization, regularization, etc. If the package used already has the trick implemented, great. But if not, it is difficult for a researcher to test the new idea using impenetrable code with a steep learning curve. So I started writing KUnet.jl which currently implements backprop with basic units like relu, standard loss functions like softmax, dropout for generalization, L1-L2 regularization, and optimization using SGD, momentum, ADAGRAD, Nesterov’s accelerated gradient etc. in less than 500 lines of Julia code. Its speed is competitive with the fastest GPU packages (here is a benchmark). For installation and usage information, please refer to the GitHub repo. The remainder of this post will present (a slightly cleaned up version of) the code as a beginner’s neural network tutorial (modeled after Honnibal’s excellent parsing example)…” […]

    Pingback by Beginning deep learning with 500 lines of Julia | thoughts... | 2015/03/03 | Reply

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: