Computational Linguistics

Demystifying NLP

A good POS tagger in about 200 lines of Python

Update: For a fast and accurate text-processing from Python, see

Up-to-date knowledge about natural language processing is mostly locked away in academia. And academics are mostly pretty self-conscious when we write. We’re careful. We don’t want to stick our necks out too much. But under-confident recommendations suck, so here’s how to write a good part-of-speech tagger.

There are a tonne of “best known techniques” for POS tagging, and you should ignore the others and just use Averaged Perceptron.

You should use two tags of history, and features derived from the Brown word clusters distributed here.

If you only need the tagger to work on carefully edited text, you should use case-sensitive features, but if you want a more robust tagger you should avoid them because they’ll make you over-fit to the conventions of your training domain. Instead, features that ask “how frequently is this word title-cased, in a large sample from the web?” work well. Then you can lower-case your comparatively tiny training corpus.

For efficiency, you should figure out which frequent words in your training data have unambiguous tags, so you don’t have to do anything but output their tags when they come up. About 50% of the words can be tagged that way.

And unless you really, really can’t do without an extra 0.1% of accuracy, you probably shouldn’t bother with any kind of search strategy — you should just use a greedy model.

If you do all that, you’ll find your tagger easy to write and understand, and an efficient Cython implementation will perform as follows on the standard evaluation, 130,000 words of text from the Wall Street Journal:

Tagger Accuracy Time (130k words)
CyGreedyAP 97.1% 4s

The 4s includes initialisation time — the actual per-token speed is high enough to be irrelevant; it won’t be your bottleneck.

It’s tempting to look at 97% accuracy and say something similar, but that’s not true. My parser is about 1% more accurate if the input has hand-labelled POS tags, and the taggers all perform much worse on out-of-domain data. Unfortunately accuracies have been fairly flat for the last ten years. That’s why my recommendation is to just use a simple and fast tagger that’s roughly as good.

The thing is though, it’s very common to see people using taggers that aren’t anywhere near that good!  For an example of what a non-expert is likely to use, these were the two taggers wrapped by TextBlob, a new Python api that I think is quite neat:

Tagger Accuracy Time (130k words)
NLTK 94.0% 3m56s
Pattern 93.5% 26s

Both Pattern and NLTK are very robust and beautifully well documented, so the appeal of using them is obvious. But Pattern’s algorithms are pretty crappy, and NLTK carries tremendous baggage around in its implementation because of its massive framework, and double-duty as a teaching tool.

As a stand-alone tagger, my Cython implementation is needlessly complicated — it was written for my parser. So today I wrote a 200 line version of my recommended algorithm for TextBlob. It gets:

Tagger Accuracy Time (130k words)
PyGreedyAP 96.8% 12s

I traded some accuracy and a lot of efficiency to keep the implementation simple. Here’s a far-too-brief description of how it works.

Averaged perceptron

POS tagging is a “supervised learning problem”. You’re given a table of data, and you’re told that the values in the last column will be missing during run-time. You have to find correlations from the other columns to predict that value.

So for us, the missing column will be “part of speech at word i“. The predictor columns (features) will be things like “part of speech at word i-1“, “last three letters of word at i+1“, etc

First, here’s what prediction looks like at run-time:

  def predict(self, features):
      '''Dot-product the features and current weights and return the best class.'''
      scores = defaultdict(float)
      for feat in features:
          if feat not in self.weights:
          weights = self.weights[feat]
          for clas, weight in weights.items():
              scores[clas] += weight
      # Do a secondary alphabetic sort, for stability
      return max(self.classes, key=lambda clas: (scores[clas], clas))

Earlier I described the learning problem as a table, with one of the columns marked as missing-at-runtime. For NLP, our tables are always exceedingly sparse. You have columns like “word i-1=Parliament”, which is almost always 0. So our “weight vectors” can pretty much never be implemented as vectors. Map-types are good though — here we use dictionaries.

The input data, features, is a set with a member for every non-zero “column” in our “table” — every active feature. Usually this is actually a dictionary, to let you set values for the features. But here all my features are binary present-or-absent type deals.

The weights data-structure is a dictionary of dictionaries, that ultimately associates feature/class pairs with some weight. You want to structure it this way instead of the reverse because of the way word frequencies are distributed: most words are rare, frequent words are very frequent.

Learning the weights

Okay, so how do we get the values for the weights? We start with an empty weights dictionary, and iteratively do the following:

  1. Receive a new (features, POS-tag) pair
  2. Guess the value of the POS tag given the current “weights” for the features
  3. If guess is wrong, add +1 to the weights associated with the correct class for these features, and -1 to the weights for the predicted class.

It’s one of the simplest learning algorithms. Whenever you make a mistake, increment the weights for the correct class, and penalise the weights that led to your false prediction. In code:

  def train(self, nr_iter, examples):
      for i in range(nr_iter):
          for features, true_tag in examples:
              guess = self.predict(features)
              if guess != true_tag:
                  for f in features:
                      self.weights[f][true_tag] += 1
                      self.weights[f][guess] -= 1

If you iterate over the same example this way, the weights for the correct class would have to come out ahead, and you’d get the example right. If you think about what happens with two examples, you should be able to see that it will get them both right unless the features are identical. In general the algorithm will converge so long as the examples are linearly separable, although that doesn’t matter for our purpose.

Averaging the weights

We need to do one more thing to make the perceptron algorithm competitive. The problem with the algorithm so far is that if you train it twice on slightly different sets of examples, you end up with really different models. It doesn’t generalise that smartly. And the problem is really in the later iterations — if you let it run to convergence, it’ll pay lots of attention to the few examples it’s getting wrong, and mutate its whole model around them.

So, what we’re going to do is make the weights more `sticky’ — give the model less chance to ruin all its hard work in the later rounds. And we’re going to do that by returning the averaged weights, not the final weights.

I doubt there are many people who are convinced that’s the most obvious solution to the problem, but whatever. We’re not here to innovate, and this way is time tested on lots of problems. If you have another idea, run the experiments and tell us what you find. Actually I’d love to see more work on this, now that the averaged perceptron has become such a prominent learning algorithm in NLP.

Okay. So this averaging. How’s that going to work? Note that we don’t want to just average after each outer-loop iteration. We want the average of all the values — from the inner loop. So if we have 5,000 examples, and we train for 10 iterations, we’ll average across 50,000 values for each weight.

Obviously we’re not going to store all those intermediate values. Instead, we’ll track an accumulator for each weight, and divide it by the number of iterations at the end. Again: we want the average weight assigned to a feature/class pair during learning, so the key component we need is the total weight it was assigned. But we also want to be careful about how we compute that accumulator, too. On almost any instance, we’re going to see a tiny fraction of active feature/class pairs. All the other feature/class weights won’t change. So we shouldn’t have to go back and add the unchanged value to our accumulators anyway, like chumps.

Since we’re not chumps, we’ll make the obvious improvement. We’ll maintain another dictionary that tracks how long each weight has gone unchanged. Now when we do change a weight, we can do a fast-forwarded update to the accumulator, for all those iterations where it lay unchanged.

Here’s what a weight update looks like now that we have to maintain the totals and the time-stamps:

  def update(self, truth, guess, features):
      def upd_feat(c, f, v):
          nr_iters_at_this_weight = self.i - self._timestamps[f][c]
          self._totals[f][c] += nr_iters_at_this_weight * self.weights[f][c]
          self.weights[f][c] += v
          self._timestamps[f][c] = self.i

      self.i += 1
      for f in features:
          upd_feat(truth, f, 1.0)
          upd_feat(guess, f, -1.0)

Features and pre-processing

The POS tagging literature has tonnes of intricate features sensitive to case, punctuation, etc. They help on the standard test-set, which is from Wall Street Journal articles from the 1980s, but I don’t see how they’ll help us learn models that are useful on other text.

To help us learn a more general model, we’ll pre-process the data prior to feature extraction, as follows:

  • All words are lower cased;
  • Digits in the range 1800-2100 are represented as !YEAR;
  • Other digit strings are represented as !DIGITS

It would be better to have a module recognising dates, phone numbers, emails, hash-tags, etc. but that will have to be pushed back into the tokenization.

I played around with the features a little, and this seems to be a reasonable bang-for-buck configuration in terms of getting the development-data accuracy to 97% (where it typically converges anyway), and having a smaller memory foot-print:

    def _get_features(self, i, word, context, prev, prev2):
        '''Map tokens-in-contexts into a feature representation, implemented as a
        set. If the features change, a new model must be trained.'''
        def add(name, *args):
            features.add('+'.join((name,) + tuple(args)))

        features = set()
        add('bias') # This acts sort of like a prior
        add('i suffix', word[-3:])
        add('i pref1', word[0])
        add('i-1 tag', prev)
        add('i-2 tag', prev2)
        add('i tag+i-2 tag', prev, prev2)
        add('i word', context[i])
        add('i-1 tag+i word', prev, context[i])
        add('i-1 word', context[i-1])
        add('i-1 suffix', context[i-1][-3:])
        add('i-2 word', context[i-2])
        add('i+1 word', context[i+1])
        add('i+1 suffix', context[i+1][-3:])
        add('i+2 word', context[i+2])
        return features

I haven’t added any features from external data, such as case frequency statistics from the Google Web 1T corpus. I might add those later, but for now I figured I’d keep things simple.

What about search?

The model I’ve recommended commits to its predictions on each word, and moves on to the next one. Those predictions are then used as features for the next word. There’s a potential problem here, but it turns out it doesn’t matter much. It’s easy to fix with beam-search, but I say it’s not really worth bothering. And it definitely doesn’t matter enough to adopt a slow and complicated algorithm like Conditional Random Fields.

Here’s the problem. The best indicator for the tag at position, say, 3 in a sentence is the word at position 3. But the next-best indicators are the tags at positions 2 and 4. So there’s a chicken-and-egg problem: we want the predictions for the surrounding words in hand before we commit to a prediction for the current word. Here’s an example where search might matter:

Their management plan reforms worked

Depending on just what you’ve learned from your training data, you can imagine making a different decision if you started at the left and moved right, conditioning on your previous decisions, than if you’d started at the right and moved left.

If that’s not obvious to you, think about it this way: “worked” is almost surely a verb, so if you tag “reforms” with that in hand, you’ll have a different idea of its tag than if you’d just come from “plan“, which you might have regarded as either a noun or a verb.

Search can only help you when you make a mistake. It can prevent that error from throwing off your subsequent decisions, or sometimes your future choices will correct the mistake. And that’s why for POS tagging, search hardly matters! Your model is so good straight-up that your past predictions are almost always true. So you really need the planets to align for search to matter at all.

And as we improve our taggers, search will matter less and less. Instead of search, what we should be caring about is multi-tagging. If we let the model be a bit uncertain, we can get over 99% accuracy assigning an average of 1.05 tags per word (Vadas et al, ACL 2006). The averaged perceptron is rubbish at multi-tagging though. That’s its big weakness. You really want a probability distribution for that.

One caveat when doing greedy search, though. It’s very important that your training data model the fact that the history will be imperfect at run-time. Otherwise, it will be way over-reliant on the tag-history features. Because the Perceptron is iterative, this is very easy.

Here’s the training loop for the tagger:

    def train(self, sentences, save_loc=None, nr_iter=5, quiet=False):
        '''Train a model from sentences, and save it at save_loc. nr_iter
        controls the number of Perceptron training iterations.'''
        self._make_tagdict(sentences, quiet=quiet)
        self.model.classes = self.classes
        prev, prev2 = START
        for iter_ in range(nr_iter):
            c = 0; n = 0
            for words, tags in sentences:
                context = START + [self._normalize(w) for w in words] + END
                for i, word in enumerate(words):
                    guess = self.tagdict.get(word)
                    if not guess:
                        feats = self._get_features(i, word, context, prev, prev2)
                        guess = self.model.predict(feats)
                        self.model.update(tags[i], guess, feats)
                    # Set the history features from the guesses, not the true tags
                    prev2 = prev; prev = guess
                    c += guess == tags[i]; n += 1
            if not quiet:
                print("Iter %d: %d/%d=%.3f" % (iter_, c, n, _pc(c, n)))
        # Pickle as a binary file
        if save_loc is not None:
            cPickle.dump((self.model.weights, self.tagdict, self.classes),
                         open(save_loc, 'wb'), -1)

Unlike the previous snippets, this one’s literal — I tended to edit the previous ones to simplify. So if they have bugs, hopefully that’s why!

At the time of writing, I’m just finishing up the implementation before I submit a pull request to TextBlob. You can see the rest of the source here:

A final comparison…

Over the years I’ve seen a lot of cynicism about the WSJ evaluation methodology. The claim is that we’ve just been meticulously over-fitting our methods to this data. Actually the evidence doesn’t really bear this out. Mostly, if a technique is clearly better on one evaluation, it improves others as well. Still, it’s very reasonable to want to know how these tools perform on other text. So I ran the unchanged models over two other sections from the OntoNotes corpus:

Tagger WSJ ABC Web
Pattern 93.5 90.7 88.1
NLTK 94.0 91.5 88.4
PyGreedyAP 96.8 94.8 91.8

The ABC section is broadcast news, Web is text from the web (blogs etc — I haven’t looked at the data much).

As you can see, the order of the systems is stable across the three comparisons, and the advantage of our Averaged Perceptron tagger over the other two is real enough. Actually the pattern tagger does very poorly on out-of-domain text. It mostly just looks up the words, so it’s very domain dependent. I hadn’t realised it before, but it’s obvious enough now that I think about it.

We can improve our score greatly by training on some of the foreign data. The technique described in this paper (Daume III, 2007) is the first thing I try when I have to do that.

P.S. The format of these tables sucks. Probably I should move off to actual hosting, but in the meantime, any suggestions?


2013/09/11 - Posted by | Uncategorized


  1. […] Honnibal wrote a clear and detailed blog post about the Averaged Perception and his implementation here. For this reason, this post will focus on how to get and use the tagger without providing […]

    Pingback by Tutorial: State-of-the-art Part-of-Speech Tagging in TextBlob | spider's space | 2013/09/17 | Reply

  2. One of the most readable, clear, and just best posts I’ve ever seen on CL. I wish the Internet was always this good.

    Comment by halgrimnlp | 2013/09/19 | Reply

  3. First of all congratulations on the nice Python implementation for Average Perceptron and the clearly-written documentation!

    It is true that academics (computer linguists) are careful. Reasons I see is that accuracy scores are domain-dependent, and that training data is often licensed.

    Since the Brill tagger is a “first-generation” tagger, it is also true that more robust methods have been around for some time. If nothing else, the Brill Tagger is (generally accepted to be) public domain, making it compatible with Pattern’s BSD license. Since TronTagger (or PyGreedyAP) is now part of TextBlob (MIT, free for commercial use), did you check the license on your training data?

    Below are some tests I did, with accuracy scores and performance. I used WSJ section 20 and OANC (assorted sentences from written_1/journal/, 180,000 words). The high score for Brill on WSJ section 20 might be due to that it was trained on the same data, I don’t know the details.

    1) {word: tag} dictionary: 0.926 WSJ (1s), 0.888 OANC (1.5s)
    2) Brill tagger (Pattern): 0.965 WSJ (32s), 0.925 OANC (124s)
    3) Brill tagger (TextBlob): 0.965 WSJ (12s), 0.925 OANC (48s)
    4) TronTagger: 0.955 WSJ (6s), 0.912 OANC (27s)
    5) MBSP: 0.963 WSJ (82s), 0.944 (920s)
    6) dictionary + SVM: 0.955 WSJ (1.5s), 0.930 (3s)

    The SVM (6) was trained on OANC following your steps. Do you mind sharing your test data? It might be interesting to continue along this path to improve the Pattern taggers.


    Comment by tomdesmedt | 2013/09/19 | Reply

    • Hi Tom,

      Thanks for running these tests!

      The awkward thing about train/dev/test splits is that POS taggers usually use 00-18/19-21/22-24, and parsers use 02-21/24/23, for historical reasons. So if a tagger was trained as a pre-processor to a parser being tested in the standard way, the tagger will be trained on 20. So I think it’s probably safer to test on 22, 23 or 24.

      About that OANC corpus, I’ve just had a quick google and the pages are talking about the tags being automatically added. Is there a manually annotated subset?

      Finally, for the SVM model in 6, how did the tag history features work? Did you do some sort of boot-strapping, or did the problem turn out not to matter?

      Comment by honnibal | 2013/09/19 | Reply

  4. OK you nerds, very impressive. I’m not a linguist (I thought CL stood for Craigslist) nor am I a programmer (I thought textblob really was an upcoming movie). My (should have been) fleeting curiosity to understand the “state and nature” of modern spell-check/correction/suggestion techniques has led my to your doorstep.

    I have smoke coming out of my ears from my brain overheating as I’ve reached the end of my breadcrumbs here… – in this blog and comment section. With posts and comments; and replys still dripping wet ink – from you kingpins of CL. So I say; “F** it.” I’ll just bite the bullet; admit I’m inferior in the face of greatness; throw in the towel on my self-guided research; and just humbly ask you guys to “dumb it down” for me; so I can quit chasing my tail.

    I’ll try to explain (using English); and after some well deserved ridicule, could you; would you; Please…enlighten me on the following paradigm.

    I’m interested in live spell-check/grammar/correction/suggestion methods used by modern smart devices, mostly for communication – but not limited to it. T9 seemed to have it nearly perfected 10 years ago; but has since vanished from current implementations, thanks to the phenomenal explosion of touchscreen devices. Touchscreen QWERTY (thanks Apple) obviously didn’t have any of the strict limitations imposed by the T9 numeric keypad translator . It’s a no-brainer; good riddance T9.

    The issue I have is: It seems like we deal with more spelling errors now; autosuggestion appear to be less accurate (more breadth), ultimately requiring more effort than was typical with the T9 predecessor I remember.

    So the conundrum is – The fully QWERTY touchscreen introduces the opportunity for more mistakes, which require more effort than the technically inferior T9 (quasi-wannabe-keyboard-Numeric-Keypad). Now granted; the QWERTY is superior (maybe perfect) if you type and spell flawlessly; but the touchscreen implementation almost guarantees less than perfect typing – which increases miss-spellings regardless of what a user “intended” to spell. This explains the greater breadth of the autosuggestion set, a result of “miss-keying” compounded by miss-spelling. Compared to T9, where miss-spellings typically were confined to a more limited set of alternatives, obviously this increased the likelihood of accurately suggesting the correct word.

    As far as “on-the-fly” autosuggestion/spell-checking/completion (is there an acronym for this bs?) is concerned, It would appear the textblob and NLP systems would fall short on accuracy because they rely on training data has been through a few F7 cycles before going to print at the WSJ. As far as the saying goes “Garbage in, garbage out” the training data cited should be nearly ideal regarding spelling and grammar. Fair enough. But what happens to the systems when forced to interpret “on the fly”; with the heavy reality of modern day messaging, complete with all of its miss-keyed miss-spellings; ebonicized ghetto grammar, acronyms, and the constantly evolving urban slang that surely drives 8th grade English Teachers to point of at least considering the purchase of an AR-15 (for education purposes of course).

    To their credit – the kids seem to have no issue communicating amongst themselves with a drastically expanded vernacular that propagates virally, and can morph overnight. Shwag just ain’t what it used to be, and you don’t even want to have to figure out what it means when a 10 year old uses ‘merf’ in a response to most anything. Who would have ever believed English

    I found much less information regarding spellcheck(); but interpret it to be a fairly robotic dictionary look-up, and seemingly less important in the larger context of challenges imposed by NLP algorithms. That being said; I’m very curious to know your thoughts on how the recent discovery of the phenomenon Typoglycemia is utilized by spell-check/completion algorithms; if it all?

    What is Typoglycemia. This – “I cdn’uolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg: the phaonmneel pweor of the hmuan mnid. Aoccdrnig to a rseearch taem at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Scuh a cdonition is arppoiatrely cllaed Typoglycemia .
    “Amzanig huh? Yaeh and you awlyas thguoht slpeling was ipmorantt.”

    It seems there is room for improvement. I have ideas to innovate in this dimension; but since I’m a non-programming NLP noob; I figure I’ll just ask you experts, rather than risking the effort of trying to become one; only to learn that my ideas are juvenile at best; stoopid at worst; and laughable at both.

    Can you offer some insight; tell me what I’ve missed; or just steer me in the right direction to find the “ultimate” real-time correction-suggestion engine? I’ve searched patents; programs; everything on wiki; and tried most aftermarket keyboard android apps. No doubt they QWERTY was an abominable, reprehensible; god-forsaken cruel joke and ANY remapping of characters could not degrade it’s usability. However, no matter how logical a key mapping is, the greatest strides in WPM will come from software that “gets” what we’re “saying” without testing our ability to spell, hunt, peck or correct.

    A little known factoid: Do you know what the single most common key on EVERY keyboard is??
    Backspace. (stretched right pinky). Now real quick; without playing through the movements tell me this: What 3 characters does your index finger type. Why?

    Ok, All done here, as you were.

    Comment by John Jacob Jingle-Heimer-Smith (Again) | 2013/09/20 | Reply

    • Hi,

      Mobile text entry’s an active research area. I think it’s interesting, but I haven’t worked in it. If I were trying to improve a mobile text entry system, I doubt I’d look into part-of-speech tagging. I don’t think it would be very helpful.

      I think it’s primarily an HCI (Human Computer Interface) issue, more than a language modelling issue. Sure, a better language model will give you better spelling suggestions, all else being equal. But that’s a tiny factor compared to having better design.

      I came across this guy’s bibliography and found his ideas quite interesting: . I can’t say I’ve read any of it carefully, though.

      Comment by honnibal | 2013/09/20 | Reply

  5. Very nice post & code!
    I just implemented a perceptron learner myself, and it’s interesting to compare to your implementation.
    Most things are done quite similarly — however, I don’t understand your remark
    “The weights data-structure is a dictionary of dictionaries, that ultimately associates feature/class pairs with some weight. You want to structure it this way instead of the reverse […]”
    My data structure is “class -> (feature -> weight)”, and I don’t see (but would like to see) why the other way is more inefficient…

    Comment by Ben | 2013/10/12 | Reply

    • Heya,

      Good question! That was a bad explanation for that issue — I tried to keep the discussion moving, and that’s really not the right way to think about that problem.

      Consider the prediction code, where the weights are calculated:

      What would this loop look like if we’d structured the weights the other way around? Well, what we’d be doing is looping through every class for every feature.

      Most features will have been seen fewer than 100 times in the training data. So it’s very unlikely they’ll have a non-zero weight for every class. Probably only 1-5 of the classes will have a non-zero weight. And if we’ve never seen the feature at all, we can skip the class-iteration entirely.

      So it’s not really about Zipf’s law, as such. I should rewrite that bit.

      Comment by honnibal | 2013/10/12 | Reply

      • thanks for the additional explanation!

        Comment by Ben | 2013/10/12

  6. Though I don’t like Python, sometime I get amazed with its brevity. If you are going to implement same thing in Java, it will probably take much longer than to do that.

    Comment by Ajinkya | 2014/01/16 | Reply

  7. Excellent post, it inspired me to implement my own ( Two quick thoughts:

    In the averaged perceptron, with weights structured the way you have done them here, is there ultimately any reason to actually divide? Isn’t the normalizing divisor a constant, and if so, can’t we just use the summed weights at inference time?

    I ended up structuring my weights as a dictionary of class (tag) keys mapping to feature-weight key-value pairs (as @Ben is saying). I spent a lot of time arguing with myself about which made more sense but ultimately settled on this. If there’s any optimization I missed because of that, I don’t see it, but I’d be interested to know.

    Comment by Kyle Gorman | 2014/03/15 | Reply

    • *blinks*

      How the fuck did I miss that all these years? Going to have to run the experiment just to check I’m not dreaming, but you’re surely right.

      Empirically, structuring the weights as feature-to-class is much better for this problem.

      We have 45 classes and lets say 15 active features per token. If we iterate per class, we get back a dict and must make 15 lookups into it, most missing. If we iterate per feature, we can get back a sparse array — we know we need every value returned to us, we don’t have to look them up.

      A nice, easy optimisation would be to update the features data structure to map to a numpy sparse array on the inside instead of a dict.

      Comment by honnibal | 2014/03/15 | Reply

      • Please do correct me if I’m mistaken (as I surely am!), but if we use the sparse fast-forward updates for the accumulator fields, won’t we need to iterate over the features at the end of training anyway to “finalize” the accumulator values for features that weren’t present in the final example?

        While I agree that we could in principle use the raw accumulator values without dividing by the number of examples seen, it might be worthwhile to stick with convention if we’re already paying the cost of touching each feature at the end of training.

        Of course, if you *don’t* do the final accumulator updates, my point is moot…. and perhaps the weird pseudo-regularization one gets from ignoring accumulations on the last update on each feature pleases the spirit of your data, and doesn’t even hurt performance too much 😛

        Comment by Ryan Musa | 2014/07/21

      • Yeah, you’ll still need the accumulator updates.

        The upside of not averaging is that the weights can be integer-valued, rather than floats. That could be quite nice!

        On the other hand, it becomes more annoying to scale the weights against something else. The perceptron actually approximates a log probability estimate (in the sense that you can start from the idea that that’s what you’re trying to do, make some simplifying assumptions, and end up with the perceptron). It’s harder to interpret the “weight-accumulated” perceptron (as opposed to averaged perceptron) in that way.

        Comment by honnibal | 2014/07/21

      • Thanks for the quick reply! I agree that keeping everything integer-valued could be really helpful, especially if you’re only looking for the one-best solution during decoding.

        Comment by Ryan Musa | 2014/07/22

  8. This awesome post gave me idea for my master thesis topic. The goal is to tag (and maybe even parse) a non-standard english (learner) corpus. I just started digging into python, and only have previous experience with web-development, hence have very humble expectations of my own abilities to write a tagger/parser. But because of the complications naturally connected with “non-standard” part, I am not sure i can reliably use any existing taggers (and, gosh, nltk is massive!) and would probably need to invent my own tags to identify mistakes etc. I still haven’t quite figured out the learning algorithms though.

    Comment by Elvina | 2014/03/20 | Reply

    • Thanks for the kind words.

      There’s a growing literature on processing learner English. My colleague at Macquarie Mark Dras has done some of the early work on this, with his student Jojo Wong. Have a look for their papers, and let me know if you have any questions.


      Comment by honnibal | 2014/03/21 | Reply

      • Sorry for the noob question, but this type of tagger can be trained on non-stadrd english, which will result in similar value predictions?


        Comment by Elvina | 2014/06/10

  9. Hey honnibal, great stuff, I’m working on a comparison of different POS tagging methods. Could you tell me how you got the WSJ labelled data? Thanks.

    Comment by Bryan | 2014/05/16 | Reply

    • Via the Linguistic Data Consortium — the Penn Treebank 2 corpus.

      These days it’s better to obtain OntoNotes 5 and the Google Web Treebank though; they’re substantially cheaper.

      Comment by honnibal | 2014/05/16 | Reply

  10. > When I train on WSJ 0-21 and evaluate on 22-24, I get 98.01 accuracy on the training set and only 95.52 on the test set.

    Train on WSJ 00-19 and use 22-24 to benchmark, and then run only one or two tests on 19-21. The methodological discipline is important…Otherwise eventually you’ll fool yourself on your benchmark evaluation, and not have a test set to catch the error.

    > Now, I tried to re-implement you algorithm for my Lisp NLP toolkit, and I can’t reproduce your results (well, you know, as usual 🙂
    > What may I be doing wrong?

    Well…How should I know? 🙂

    What I would do in your situation is turn off all the features except one or two, and see whether you get the same accuracies. If you don’t, look at the feature values being spat out by both models, and verify they’re the same. If they are, then look at the weights being learned, and the predictions being made. If the two classifiers do perform the same on the minimal feature set, progressively re-add the features, until the scores diverge.

    > Also, you forgot to mention that a big part of speedup is due to single tag dictionary lookup. I get 3.5 sec performance with it and 9.2 sec without it.

    I think I said somewhere near the start that the tag dictionary serves about half of the tokens. Did you add the same thresholds as I did? You don’t want to use the tag dict for rare words.

    For tagging problems with many more classes, it’s good to use the tag dictionary even for unambiguous words — you get the set of possible classes, so that you only have to evaluate those ones. (The trade-off is that [(k, dict[k]) for k in dict.keys()] is much slower than dict.items() — so if the set of valid classes we iterate through is on average close in size to the full dictionary, we should just use dict.items(). Which is what happens on this problem.)

    Comment by Vsevolod (@vseloved) | 2014/07/01 | Reply

    • You’ve replied inline to my comment, which may be misleading to other users, probably, since I seem to be talking to myself 🙂 Anyway…

      > Train on WSJ 00-19 and use 22-24

      I’ve used 00-22 because you mentioned it in a previous comment. OK, training on 00-18 gives the following result:

      test – 95.02
      dev – 94.8
      train – 98.03

      Do you use 5 training runs?

      Next I’ll try the long route of feature-to-feature comparison. But can you also post or point to your evaluation code and the functions you use to extract the gold dataset? I’m particularily interested, wether you use original or predicted tags in the feature vectors for the golden corpus.

      > Run only one or two tests on 19-21. The methodological discipline is important…

      I don’t think there’s any methodological problem here – I’m just repeating your experiments.

      > Tagdicts: Did you add the same thresholds as I did?

      Yes, I’m reproducing your approach, as you remeber. I’ve also used your normalization routine, although I’ve seen it run somewhat better (.1 or .2) using somewhat more advanced normalization for numbers.

      > For tagging problems with many more classes, it’s good to use the tag dictionary even for unambiguous words.

      Indeed. From my experience I can only agree with you. It saves a lot of embarassing errors in production 🙂

      > The trade-off is that [(k, dict[k]) for k in dict.keys()] is much slower than dict.items()

      That’s, probably, because you’re making new tuples here. Try either iterating the dict or a dict comprehension (it should be optimized for such use case, although I’m not sure)

      Comment by Vsevolod (@vseloved) | 2014/07/02 | Reply

      • Oh I didn’t realise how that would show up on WP. Crazy.

        Anyway. I’m confused about your processing question, and I think this might be the problem with your accuracy.

        The training function takes the gold tags as input — how could it not? That’s what we’re supervising the parser to produce. But then as we train, we must use the tags we’ve previously predicted as features.

        Try just turning off the POS tag features, training my tagger, and then training your tagger?

        About the dict iteration, I’m pretty sure the difference is that looking up into the dictionary a bunch of times is slower than just iterating through its contents, so long as you don’t care how the contents are ordered when you iterate. I don’t think it’s about the tuple creation.

        Comment by honnibal | 2014/07/02

      • Yeah, WP, is really a mess. Now I can’t comment on your reply, just on mine.

        So, speaking about the POS tag features: I’ve implemented the training phase the way you describe, but I’m interested in the evaluation phase. During evaluation you give your classification function a vector of features, extracted from gold data. The question is, when you’re extracting the features, you can do it in 2 manners:

        – the same as in training, running classify for each token, remembering the predicted POS tags, and using them as features for subsequent tokens
        – or without the involvement of the classifier: just taking the POS tags available in the gold data

        Which way did you choose?

        Comment by Vsevolod (@vseloved) | 2014/07/02

      • Just a reminder about the number of training iterations question.

        I’ve also evaluated on ABC and Web from OntoNotes 4, and got 94.03 and 90.12 accuracy. Consistent lower numbers as with the WSJ corpus.

        Comment by Vsevolod (@vseloved) | 2014/07/02

  11. It uses the predicted POS tags as features at the prediction phase.

    If you run my code, do you get back the same numbers that I do?

    Comment by honnibal | 2014/07/03 | Reply

    • God, I can’t reply to your comment and can’t get it back. What a mess.

      This bit of code makes clear that the tag function consumes plain text. Note that there’s actually a bug here: we should set prev1 and prev2 to START per sentence. The bug slightly depresses accuracy…but only slightly.

      Comment by honnibal | 2014/07/03 | Reply

      • Yes, I’ve seen the bug and didn’t reproduce it in my code. Sorry, forgot to mention it.
        And your remark about plain text makes it clear: so, you don’t extract features directly for the evaluation step, but supply original sentences. Makes sense. Well, then we emply, basically, the same approach to evaluation. Now, I need to dig deeper and compare the taggers with different feature sets as you suggest…

        Comment by Vsevolod (@vseloved) | 2014/07/03

      • OK, I have run the evaluation of your model in textblob ( and got 94.32%. Here’s a link to the pretty-printed output of your tagger:

        One thing I’ve noticed is that due to NLTK tokenization for some sentences I get a different number of tokens from the tagger, than originally in the treebank. Here’s one example:

        The_DT complex_JJ financing_NN plan_NN in_IN the_DT S&L_NN bailout_NN law_NN includes_VBZ raising_VBG $_$ 30_CD billion_CD from_IN debt_NN issued_VBN by_IN the_DT newly_RB created_VBN RTC_NNP ._.

        tagged as:
        The_DT complex_JJ financing_NN plan_NN in_IN the_DT S_NNP &_CC L_NNP bailout_NN law_NN includes_VBZ raising_VBG $_$ 30_CD billion_CD from_IN debt_NN issued_VBN by_IN the_DT newly_RB created_VBN RTC_NNP ._.

        (S&L is split into 3 tokens)

        I used diff to caluclate the overlap, so, I believe, 94.32% is the best number. With naive matching the result is 92.80%, because when the sequences of tags diverge you, obviously, get a mess.

        Can you check it, please?

        Comment by Vsevolod (@vseloved) | 2014/07/03

      • What text are you even feeding in? I evaluate the tagger on pre-tokenized text because I’m not aware of any easy way to get back the original text from the PTB files.

        If your tokenization ever mismatches your training data, your accuracy will tank. This sucks, and is one of the big “gotchas” in NLP. I’m working on a better tokenizer; the way we do this at the moment is stupid.

        Comment by honnibal | 2014/07/04

      • Well, I extract text from treebanked data (i.e. from the parse trees – they have an extension .parse in OntonNotes WSJ).
        But how do you evaluate the tagger, if you don’t have tokenized sentences? Where do you take the tags from?

        Comment by Vsevolod (@vseloved) | 2014/07/04

      • The treebanked data is already tokenized — you should just split it on whitespace, or consume a list of tokens. It doesn’t make sense to try to apply something like the nltk tokenizer to it.

        Comment by honnibal | 2014/07/05

      • OK, I see. So you call tag() and compare the results to the treebanked tokenized sentences. That’s, basically, the same thing that I have done. So, the remaining question is that I still got 94.32% accuracy for your tagger instead of 96.8%. What am I doing wrong then?

        Comment by Vsevolod (@vseloved) | 2014/07/05

      • I hought you said S&L was being tokenized as S & L for you?

        Comment by honnibal | 2014/07/06

      • Yes, that’s true. But you said that you evaluate th etagger by giving raw strings to tag() function. I did the same. I assume, you should get the same result. If not, then you either evaluate your tagger differently, or you use some other version of NLTK/NLTK models. That’s why I was asking for your evaluation procedure from the beginning.

        Comment by Vsevolod (@vseloved) | 2014/07/08

      • Well what I meant was that the strings aren’t POS tagged. They still have the PTB tokenization and sentence boundary detection.

        Comment by honnibal | 2014/07/08

      • OK, but the situation with “S&L” in the sentence “The_DT complex_JJ financing_NN plan_NN in_IN the_DT S&L_NN bailout_NN law_NN includes_VBZ raising_VBG $_$ 30_CD billion_CD from_IN debt_NN issued_VBN by_IN the_DT newly_RB created_VBN RTC_NNP ._.” is that I pass the strings tokenized as is (i.e. as in the treebank), but the tagger (using NLTK tokenizer) splits the word “S&L”. How do I tell the tagger to just use the original tokenization (i.e. split on spaces, I assume)?

        Comment by Vsevolod (@vseloved) | 2014/07/08

  12. =/

    Pass the flag tokenize=False

    This tells it to use the function lambda t: t.split() to tokenize the sentence, instead of the function nltk.word_tokenize.

    Comment by honnibal | 2014/07/09 | Reply

    • OK, thanks. Sorry, I’ve missed that part of the code.
      Anyway, now the result is 94.34%. Basically, the same as before. Studying the differenences I see, that “-” is tokenized as “:” while in the treebank I have it as “HYPH”. Counting “-” as properly tokenized I can get the number up to 95.55%. Alsmost exatcly the same number I got for my model (95.52%). Looks like we’ve converged, but not at 96.8, unfortunately. What do you think?

      Comment by Vsevolod (@vseloved) | 2014/07/09 | Reply

      • There are probably some other trivial differences, about how the brackets or punctuation are spelled. I don’t know.

        Comment by honnibal | 2014/07/09

      • The only other significant artifact I see is that “to” is always tagged as TO, while in the treebank there’s a lot of instances of to/IN. However, I would say that to/IN is a more proper tagging in those situations, although I’ve seen most of the taggers unconditionally mark it as TO.

        Comment by Vsevolod (@vseloved) | 2014/07/09

      • It sounds like you’ve got a different version of the treebank. You should re-train on your data. In the PTB3, which I trained on, “to” always receives the tag “TO”.

        Comment by honnibal | 2014/07/10

      • OK, I’ll try to do it. I’ve used the version of the treebank in OntonNotes 4 (section nw/wsj), assuming it is the same, but, maybe, it was updated. I’ll try to use the original PTB 3.
        At the same time this brings us to the question of how well we can trust the results of such evaluation if even minor differences in the training data cause rather substantial (>1%) differences in accuracy…

        Comment by Vsevolod (@vseloved) | 2014/07/10

      • OK, on PTB 3.7 I got 96.21% (or 96.50% if you ignore wrong tagging of ‘(‘ and ‘)’ – somehow they are not written as -LRB- and -RRB- like in Ontonotes, and I’m not sure what your tagger expects). Pretty close, but still not quite it. Maybe, you used yet another version of PTB?

        Comment by Vsevolod (@vseloved) | 2014/07/11

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: