Computational Linguistics

Demystifying NLP

A good POS tagger in about 200 lines of Python

Up-to-date knowledge about natural language processing is mostly locked away in academia. And academics are mostly pretty self-conscious when we write. We’re careful. We don’t want to stick our necks out too much. But under-confident recommendations suck, so here’s how to write a good part-of-speech tagger.

There are a tonne of “best known techniques” for POS tagging, and you should ignore the others and just use Averaged Perceptron.

  • You should use two tags of history, and features derived from the Brown word clusters distributed here.

    If you only need the tagger to work on carefully edited text, you should use case-sensitive features, but if you want a more robust tagger you should avoid them because they’ll make you over-fit to the conventions of your training domain. Instead, features that ask “how frequently is this word title-cased, in a large sample from the web?” work well. Then you can lower-case your comparatively tiny training corpus.

    For efficiency, you should figure out which frequent words in your training data have unambiguous tags, so you don’t have to do anything but output their tags when they come up. About 50% of the words can be tagged that way.

    And unless you really, really can’t do without an extra 0.1% of accuracy, you probably shouldn’t bother with any kind of search strategy — you should just use a greedy model.

    If you do all that, you’ll find your tagger easy to write and understand, and an efficient Cython implementation will perform as follows on the standard evaluation, 130,000 words of text from the Wall Street Journal:

    Tagger Accuracy Time (130k words)
    CyGreedyAP 97.1% 4s

    The 4s includes initialisation time — the actual per-token speed is high enough to be irrelevant; it won’t be your bottleneck.

    It’s tempting to look at 97% accuracy and say something similar, but that’s not true. My parser is about 1% more accurate if the input has hand-labelled POS tags, and the taggers all perform much worse on out-of-domain data. Unfortunately accuracies have been fairly flat for the last ten years. That’s why my recommendation is to just use a simple and fast tagger that’s roughly as good.

    The thing is though, it’s very common to see people using taggers that aren’t anywhere near that good!  For an example of what a non-expert is likely to use, these were the two taggers wrapped by TextBlob, a new Python api that I think is quite neat:

    Tagger Accuracy Time (130k words)
    NLTK 94.0% 3m56s
    Pattern 93.5% 26s

    Both Pattern and NLTK are very robust and beautifully well documented, so the appeal of using them is obvious. But Pattern’s algorithms are pretty crappy, and NLTK carries tremendous baggage around in its implementation because of its massive framework, and double-duty as a teaching tool.

    As a stand-alone tagger, my Cython implementation is needlessly complicated — it was written for my parser. So today I wrote a 200 line version of my recommended algorithm for TextBlob. It gets:

    Tagger Accuracy Time (130k words)
    PyGreedyAP 96.8% 12s

    I traded some accuracy and a lot of efficiency to keep the implementation simple. Here’s a far-too-brief description of how it works.

    Averaged perceptron

    POS tagging is a “supervised learning problem”. You’re given a table of data, and you’re told that the values in the last column will be missing during run-time. You have to find correlations from the other columns to predict that value.

    So for us, the missing column will be “part of speech at word i“. The predictor columns (features) will be things like “part of speech at word i-1“, “last three letters of word at i+1“, etc

    First, here’s what prediction looks like at run-time:

    
      def predict(self, features):
          '''Dot-product the features and current weights and return the best class.'''
          scores = defaultdict(float)
          for feat in features:
              if feat not in self.weights:
                  continue
              weights = self.weights[feat]
              for clas, weight in weights.items():
                  scores[clas] += weight
          # Do a secondary alphabetic sort, for stability
          return max(self.classes, key=lambda clas: (scores[clas], clas))
    
    

    Earlier I described the learning problem as a table, with one of the columns marked as missing-at-runtime. For NLP, our tables are always exceedingly sparse. You have columns like “word i-1=Parliament”, which is almost always 0. So our “weight vectors” can pretty much never be implemented as vectors. Map-types are good though — here we use dictionaries.

    The input data, features, is a set with a member for every non-zero “column” in our “table” — every active feature. Usually this is actually a dictionary, to let you set values for the features. But here all my features are binary present-or-absent type deals.

    The weights data-structure is a dictionary of dictionaries, that ultimately associates feature/class pairs with some weight. You want to structure it this way instead of the reverse because of the way word frequencies are distributed: most words are rare, frequent words are very frequent.

    Learning the weights

    Okay, so how do we get the values for the weights? We start with an empty weights dictionary, and iteratively do the following:

    1. Receive a new (features, POS-tag) pair
    2. Guess the value of the POS tag given the current “weights” for the features
    3. If guess is wrong, add +1 to the weights associated with the correct class for these features, and -1 to the weights for the predicted class.

    It’s one of the simplest learning algorithms. Whenever you make a mistake, increment the weights for the correct class, and penalise the weights that led to your false prediction. In code:

    
      def train(self, nr_iter, examples):
          for i in range(nr_iter):
              for features, true_tag in examples:
                  guess = self.predict(features)
                  if guess != true_tag:
                      for f in features:
                          self.weights[f][true_tag] += 1
                          self.weights[f][guess] -= 1
              random.shuffle(examples)
    
    

    If you iterate over the same example this way, the weights for the correct class would have to come out ahead, and you’d get the example right. If you think about what happens with two examples, you should be able to see that it will get them both right unless the features are identical. In general the algorithm will converge so long as the examples are linearly separable, although that doesn’t matter for our purpose.

    Averaging the weights

    We need to do one more thing to make the perceptron algorithm competitive. The problem with the algorithm so far is that if you train it twice on slightly different sets of examples, you end up with really different models. It doesn’t generalise that smartly. And the problem is really in the later iterations — if you let it run to convergence, it’ll pay lots of attention to the few examples it’s getting wrong, and mutate its whole model around them.

    So, what we’re going to do is make the weights more `sticky’ — give the model less chance to ruin all its hard work in the later rounds. And we’re going to do that by returning the averaged weights, not the final weights.

    I doubt there are many people who are convinced that’s the most obvious solution to the problem, but whatever. We’re not here to innovate, and this way is time tested on lots of problems. If you have another idea, run the experiments and tell us what you find. Actually I’d love to see more work on this, now that the averaged perceptron has become such a prominent learning algorithm in NLP.

    Okay. So this averaging. How’s that going to work? Note that we don’t want to just average after each outer-loop iteration. We want the average of all the values — from the inner loop. So if we have 5,000 examples, and we train for 10 iterations, we’ll average across 50,000 values for each weight.

    Obviously we’re not going to store all those intermediate values. Instead, we’ll track an accumulator for each weight, and divide it by the number of iterations at the end. Again: we want the average weight assigned to a feature/class pair during learning, so the key component we need is the total weight it was assigned. But we also want to be careful about how we compute that accumulator, too. On almost any instance, we’re going to see a tiny fraction of active feature/class pairs. All the other feature/class weights won’t change. So we shouldn’t have to go back and add the unchanged value to our accumulators anyway, like chumps.

    Since we’re not chumps, we’ll make the obvious improvement. We’ll maintain another dictionary that tracks how long each weight has gone unchanged. Now when we do change a weight, we can do a fast-forwarded update to the accumulator, for all those iterations where it lay unchanged.

    Here’s what a weight update looks like now that we have to maintain the totals and the time-stamps:

    
      def update(self, truth, guess, features):
          def upd_feat(c, f, v):
              nr_iters_at_this_weight = self.i - self._timestamps[f][c]
              self._totals[f][c] += nr_iters_at_this_weight * self.weights[f][c]
              self.weights[f][c] += v
              self._timestamps[f][c] = self.i
    
          self.i += 1
          for f in features:
              upd_feat(truth, f, 1.0)
              upd_feat(guess, f, -1.0)
    
    

    Features and pre-processing

    The POS tagging literature has tonnes of intricate features sensitive to case, punctuation, etc. They help on the standard test-set, which is from Wall Street Journal articles from the 1980s, but I don’t see how they’ll help us learn models that are useful on other text.

    To help us learn a more general model, we’ll pre-process the data prior to feature extraction, as follows:

    • All words are lower cased;
    • Digits in the range 1800-2100 are represented as !YEAR;
    • Other digit strings are represented as !DIGITS

    It would be better to have a module recognising dates, phone numbers, emails, hash-tags, etc. but that will have to be pushed back into the tokenization.

    I played around with the features a little, and this seems to be a reasonable bang-for-buck configuration in terms of getting the development-data accuracy to 97% (where it typically converges anyway), and having a smaller memory foot-print:

        def _get_features(self, i, word, context, prev, prev2):
            '''Map tokens-in-contexts into a feature representation, implemented as a
            set. If the features change, a new model must be trained.'''
            def add(name, *args):
                features.add('+'.join((name,) + tuple(args)))
    
            features = set()
            add('bias') # This acts sort of like a prior
            add('i suffix', word[-3:])
            add('i pref1', word[0])
            add('i-1 tag', prev)
            add('i-2 tag', prev2)
            add('i tag+i-2 tag', prev, prev2)
            add('i word', context[i])
            add('i-1 tag+i word', prev, context[i])
            add('i-1 word', context[i-1])
            add('i-1 suffix', context[i-1][-3:])
            add('i-2 word', context[i-2])
            add('i+1 word', context[i+1])
            add('i+1 suffix', context[i+1][-3:])
            add('i+2 word', context[i+2])
            return features
    

    I haven’t added any features from external data, such as case frequency statistics from the Google Web 1T corpus. I might add those later, but for now I figured I’d keep things simple.

    What about search?

    The model I’ve recommended commits to its predictions on each word, and moves on to the next one. Those predictions are then used as features for the next word. There’s a potential problem here, but it turns out it doesn’t matter much. It’s easy to fix with beam-search, but I say it’s not really worth bothering. And it definitely doesn’t matter enough to adopt a slow and complicated algorithm like Conditional Random Fields.

    Here’s the problem. The best indicator for the tag at position, say, 3 in a sentence is the word at position 3. But the next-best indicators are the tags at positions 2 and 4. So there’s a chicken-and-egg problem: we want the predictions for the surrounding words in hand before we commit to a prediction for the current word. Here’s an example where search might matter:

    Their management plan reforms worked

    Depending on just what you’ve learned from your training data, you can imagine making a different decision if you started at the left and moved right, conditioning on your previous decisions, than if you’d started at the right and moved left.

    If that’s not obvious to you, think about it this way: “worked” is almost surely a verb, so if you tag “reforms” with that in hand, you’ll have a different idea of its tag than if you’d just come from “plan“, which you might have regarded as either a noun or a verb.

    Search can only help you when you make a mistake. It can prevent that error from throwing off your subsequent decisions, or sometimes your future choices will correct the mistake. And that’s why for POS tagging, search hardly matters! Your model is so good straight-up that your past predictions are almost always true. So you really need the planets to align for search to matter at all.

    And as we improve our taggers, search will matter less and less. Instead of search, what we should be caring about is multi-tagging. If we let the model be a bit uncertain, we can get over 99% accuracy assigning an average of 1.05 tags per word (Vadas et al, ACL 2006). The averaged perceptron is rubbish at multi-tagging though. That’s its big weakness. You really want a probability distribution for that.

    One caveat when doing greedy search, though. It’s very important that your training data model the fact that the history will be imperfect at run-time. Otherwise, it will be way over-reliant on the tag-history features. Because the Perceptron is iterative, this is very easy.

    Here’s the training loop for the tagger:

    
        def train(self, sentences, save_loc=None, nr_iter=5, quiet=False):
            '''Train a model from sentences, and save it at save_loc. nr_iter
            controls the number of Perceptron training iterations.'''
            self._make_tagdict(sentences, quiet=quiet)
            self.model.classes = self.classes
            prev, prev2 = START
            for iter_ in range(nr_iter):
                c = 0; n = 0
                for words, tags in sentences:
                    context = START + [self._normalize(w) for w in words] + END
                    for i, word in enumerate(words):
                        guess = self.tagdict.get(word)
                        if not guess:
                            feats = self._get_features(i, word, context, prev, prev2)
                            guess = self.model.predict(feats)
                            self.model.update(tags[i], guess, feats)
                        # Set the history features from the guesses, not the true tags
                        prev2 = prev; prev = guess
                        c += guess == tags[i]; n += 1
                random.shuffle(sentences)
                if not quiet:
                    print("Iter %d: %d/%d=%.3f" % (iter_, c, n, _pc(c, n)))
            self.model.average_weights()
            # Pickle as a binary file
            if save_loc is not None:
                cPickle.dump((self.model.weights, self.tagdict, self.classes),
                             open(save_loc, 'wb'), -1)
    
    

    Unlike the previous snippets, this one’s literal — I tended to edit the previous ones to simplify. So if they have bugs, hopefully that’s why!

    At the time of writing, I’m just finishing up the implementation before I submit a pull request to TextBlob. You can see the rest of the source here:

    https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/taggers.py

    https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/_perceptron.py

    A final comparison…

    Over the years I’ve seen a lot of cynicism about the WSJ evaluation methodology. The claim is that we’ve just been meticulously over-fitting our methods to this data. Actually the evidence doesn’t really bear this out. Mostly, if a technique is clearly better on one evaluation, it improves others as well. Still, it’s very reasonable to want to know how these tools perform on other text. So I ran the unchanged models over two other sections from the OntoNotes corpus:

    Tagger WSJ  ABC Web
    Pattern 93.5 90.7 88.1
    NLTK 94.0 91.5 88.4
    PyGreedyAP 96.8 94.8 91.8

    The ABC section is broadcast news, Web is text from the web (blogs etc — I haven’t looked at the data much).

    As you can see, the order of the systems is stable across the three comparisons, and the advantage of our Averaged Perceptron tagger over the other two is real enough. Actually the pattern tagger does very poorly on out-of-domain text. It mostly just looks up the words, so it’s very domain dependent. I hadn’t realised it before, but it’s obvious enough now that I think about it.

    We can improve our score greatly by training on some of the foreign data. The technique described in this paper (Daume III, 2007) is the first thing I try when I have to do that.

    P.S. The format of these tables sucks. Probably I should move off wordpress.com to actual hosting, but in the meantime, any suggestions?

    About these ads

    2013/09/11 - Posted by | Uncategorized

  • 15 Comments »

    1. […] Honnibal wrote a clear and detailed blog post about the Averaged Perception and his implementation here. For this reason, this post will focus on how to get and use the tagger without providing […]

      Pingback by Tutorial: State-of-the-art Part-of-Speech Tagging in TextBlob | spider's space | 2013/09/17 | Reply

    2. One of the most readable, clear, and just best posts I’ve ever seen on CL. I wish the Internet was always this good.

      Comment by halgrimnlp | 2013/09/19 | Reply

    3. First of all congratulations on the nice Python implementation for Average Perceptron and the clearly-written documentation!

      It is true that academics (computer linguists) are careful. Reasons I see is that accuracy scores are domain-dependent, and that training data is often licensed.

      Since the Brill tagger is a “first-generation” tagger, it is also true that more robust methods have been around for some time. If nothing else, the Brill Tagger is (generally accepted to be) public domain, making it compatible with Pattern’s BSD license. Since TronTagger (or PyGreedyAP) is now part of TextBlob (MIT, free for commercial use), did you check the license on your training data?

      Below are some tests I did, with accuracy scores and performance. I used WSJ section 20 and OANC (assorted sentences from written_1/journal/, 180,000 words). The high score for Brill on WSJ section 20 might be due to that it was trained on the same data, I don’t know the details.

      1) {word: tag} dictionary: 0.926 WSJ (1s), 0.888 OANC (1.5s)
      2) Brill tagger (Pattern): 0.965 WSJ (32s), 0.925 OANC (124s)
      3) Brill tagger (TextBlob): 0.965 WSJ (12s), 0.925 OANC (48s)
      4) TronTagger: 0.955 WSJ (6s), 0.912 OANC (27s)
      5) MBSP: 0.963 WSJ (82s), 0.944 (920s)
      6) dictionary + SVM: 0.955 WSJ (1.5s), 0.930 (3s)

      The SVM (6) was trained on OANC following your steps. Do you mind sharing your test data? It might be interesting to continue along this path to improve the Pattern taggers.

      Cheers!

      Comment by tomdesmedt | 2013/09/19 | Reply

      • Hi Tom,

        Thanks for running these tests!

        The awkward thing about train/dev/test splits is that POS taggers usually use 00-18/19-21/22-24, and parsers use 02-21/24/23, for historical reasons. So if a tagger was trained as a pre-processor to a parser being tested in the standard way, the tagger will be trained on 20. So I think it’s probably safer to test on 22, 23 or 24.

        About that OANC corpus, I’ve just had a quick google and the pages are talking about the tags being automatically added. Is there a manually annotated subset?

        Finally, for the SVM model in 6, how did the tag history features work? Did you do some sort of boot-strapping, or did the problem turn out not to matter?

        Comment by honnibal | 2013/09/19 | Reply

    4. OK you nerds, very impressive. I’m not a linguist (I thought CL stood for Craigslist) nor am I a programmer (I thought textblob really was an upcoming movie). My (should have been) fleeting curiosity to understand the “state and nature” of modern spell-check/correction/suggestion techniques has led my to your doorstep.

      I have smoke coming out of my ears from my brain overheating as I’ve reached the end of my breadcrumbs here… – in this blog and comment section. With posts and comments; and replys still dripping wet ink – from you kingpins of CL. So I say; “F** it.” I’ll just bite the bullet; admit I’m inferior in the face of greatness; throw in the towel on my self-guided research; and just humbly ask you guys to “dumb it down” for me; so I can quit chasing my tail.

      I’ll try to explain (using English); and after some well deserved ridicule, could you; would you; Please…enlighten me on the following paradigm.

      I’m interested in live spell-check/grammar/correction/suggestion methods used by modern smart devices, mostly for communication – but not limited to it. T9 seemed to have it nearly perfected 10 years ago; but has since vanished from current implementations, thanks to the phenomenal explosion of touchscreen devices. Touchscreen QWERTY (thanks Apple) obviously didn’t have any of the strict limitations imposed by the T9 numeric keypad translator . It’s a no-brainer; good riddance T9.

      The issue I have is: It seems like we deal with more spelling errors now; autosuggestion appear to be less accurate (more breadth), ultimately requiring more effort than was typical with the T9 predecessor I remember.

      So the conundrum is – The fully QWERTY touchscreen introduces the opportunity for more mistakes, which require more effort than the technically inferior T9 (quasi-wannabe-keyboard-Numeric-Keypad). Now granted; the QWERTY is superior (maybe perfect) if you type and spell flawlessly; but the touchscreen implementation almost guarantees less than perfect typing – which increases miss-spellings regardless of what a user “intended” to spell. This explains the greater breadth of the autosuggestion set, a result of “miss-keying” compounded by miss-spelling. Compared to T9, where miss-spellings typically were confined to a more limited set of alternatives, obviously this increased the likelihood of accurately suggesting the correct word.

      As far as “on-the-fly” autosuggestion/spell-checking/completion (is there an acronym for this bs?) is concerned, It would appear the textblob and NLP systems would fall short on accuracy because they rely on training data has been through a few F7 cycles before going to print at the WSJ. As far as the saying goes “Garbage in, garbage out” the training data cited should be nearly ideal regarding spelling and grammar. Fair enough. But what happens to the systems when forced to interpret “on the fly”; with the heavy reality of modern day messaging, complete with all of its miss-keyed miss-spellings; ebonicized ghetto grammar, acronyms, and the constantly evolving urban slang that surely drives 8th grade English Teachers to point of at least considering the purchase of an AR-15 (for education purposes of course).

      To their credit – the kids seem to have no issue communicating amongst themselves with a drastically expanded vernacular that propagates virally, and can morph overnight. Shwag just ain’t what it used to be, and you don’t even want to have to figure out what it means when a 10 year old uses ‘merf’ in a response to most anything. Who would have ever believed English

      I found much less information regarding spellcheck(); but interpret it to be a fairly robotic dictionary look-up, and seemingly less important in the larger context of challenges imposed by NLP algorithms. That being said; I’m very curious to know your thoughts on how the recent discovery of the phenomenon Typoglycemia is utilized by spell-check/completion algorithms; if it all?

      What is Typoglycemia. This – “I cdn’uolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg: the phaonmneel pweor of the hmuan mnid. Aoccdrnig to a rseearch taem at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Scuh a cdonition is arppoiatrely cllaed Typoglycemia .
      “Amzanig huh? Yaeh and you awlyas thguoht slpeling was ipmorantt.”

      It seems there is room for improvement. I have ideas to innovate in this dimension; but since I’m a non-programming NLP noob; I figure I’ll just ask you experts, rather than risking the effort of trying to become one; only to learn that my ideas are juvenile at best; stoopid at worst; and laughable at both.

      Can you offer some insight; tell me what I’ve missed; or just steer me in the right direction to find the “ultimate” real-time correction-suggestion engine? I’ve searched patents; programs; everything on wiki; and tried most aftermarket keyboard android apps. No doubt they QWERTY was an abominable, reprehensible; god-forsaken cruel joke and ANY remapping of characters could not degrade it’s usability. However, no matter how logical a key mapping is, the greatest strides in WPM will come from software that “gets” what we’re “saying” without testing our ability to spell, hunt, peck or correct.

      A little known factoid: Do you know what the single most common key on EVERY keyboard is??
      Backspace. (stretched right pinky). Now real quick; without playing through the movements tell me this: What 3 characters does your index finger type. Why?

      Ok, All done here, as you were.

      Comment by John Jacob Jingle-Heimer-Smith (Again) | 2013/09/20 | Reply

      • Hi,

        Mobile text entry’s an active research area. I think it’s interesting, but I haven’t worked in it. If I were trying to improve a mobile text entry system, I doubt I’d look into part-of-speech tagging. I don’t think it would be very helpful.

        I think it’s primarily an HCI (Human Computer Interface) issue, more than a language modelling issue. Sure, a better language model will give you better spelling suggestions, all else being equal. But that’s a tiny factor compared to having better design.

        I came across this guy’s bibliography and found his ideas quite interesting: http://scholar.google.com.au/citations?user=7Z-F_AoAAAAJ&hl=en . I can’t say I’ve read any of it carefully, though.

        Comment by honnibal | 2013/09/20 | Reply

    5. Very nice post & code!
      I just implemented a perceptron learner myself, and it’s interesting to compare to your implementation.
      Most things are done quite similarly — however, I don’t understand your remark
      “The weights data-structure is a dictionary of dictionaries, that ultimately associates feature/class pairs with some weight. You want to structure it this way instead of the reverse [...]”
      My data structure is “class -> (feature -> weight)”, and I don’t see (but would like to see) why the other way is more inefficient…
      thanks!

      Comment by Ben | 2013/10/12 | Reply

      • Heya,

        Good question! That was a bad explanation for that issue — I tried to keep the discussion moving, and that’s really not the right way to think about that problem.

        Consider the prediction code, where the weights are calculated:

        https://github.com/sloria/TextBlob/blob/dev/text/_perceptron.py#L35

        What would this loop look like if we’d structured the weights the other way around? Well, what we’d be doing is looping through every class for every feature.

        Most features will have been seen fewer than 100 times in the training data. So it’s very unlikely they’ll have a non-zero weight for every class. Probably only 1-5 of the classes will have a non-zero weight. And if we’ve never seen the feature at all, we can skip the class-iteration entirely.

        So it’s not really about Zipf’s law, as such. I should rewrite that bit.

        Comment by honnibal | 2013/10/12 | Reply

        • thanks for the additional explanation!

          Comment by Ben | 2013/10/12

    6. Though I don’t like Python, sometime I get amazed with its brevity. If you are going to implement same thing in Java, it will probably take much longer than to do that.

      Comment by Ajinkya | 2014/01/16 | Reply

    7. Excellent post, it inspired me to implement my own (https://github.com/cslu-nlp/PerceptronixPointNever). Two quick thoughts:

      In the averaged perceptron, with weights structured the way you have done them here, is there ultimately any reason to actually divide? Isn’t the normalizing divisor a constant, and if so, can’t we just use the summed weights at inference time?

      I ended up structuring my weights as a dictionary of class (tag) keys mapping to feature-weight key-value pairs (as @Ben is saying). I spent a lot of time arguing with myself about which made more sense but ultimately settled on this. If there’s any optimization I missed because of that, I don’t see it, but I’d be interested to know.

      Comment by Kyle Gorman | 2014/03/15 | Reply

      • *blinks*

        How the fuck did I miss that all these years? Going to have to run the experiment just to check I’m not dreaming, but you’re surely right.

        Empirically, structuring the weights as feature-to-class is much better for this problem.

        We have 45 classes and lets say 15 active features per token. If we iterate per class, we get back a dict and must make 15 lookups into it, most missing. If we iterate per feature, we can get back a sparse array — we know we need every value returned to us, we don’t have to look them up.

        A nice, easy optimisation would be to update the features data structure to map to a numpy sparse array on the inside instead of a dict.

        Comment by honnibal | 2014/03/15 | Reply

    8. This awesome post gave me idea for my master thesis topic. The goal is to tag (and maybe even parse) a non-standard english (learner) corpus. I just started digging into python, and only have previous experience with web-development, hence have very humble expectations of my own abilities to write a tagger/parser. But because of the complications naturally connected with “non-standard” part, I am not sure i can reliably use any existing taggers (and, gosh, nltk is massive!) and would probably need to invent my own tags to identify mistakes etc. I still haven’t quite figured out the learning algorithms though.

      Comment by Elvina | 2014/03/20 | Reply

      • Thanks for the kind words.

        There’s a growing literature on processing learner English. My colleague at Macquarie Mark Dras has done some of the early work on this, with his student Jojo Wong. Have a look for their papers, and let me know if you have any questions.

        Cheers
        Matt.

        Comment by honnibal | 2014/03/21 | Reply


    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    Follow

    Get every new post delivered to your Inbox.

    %d bloggers like this: