September 2015 – Antoine Augusti

I’ve been into Golang lately, and today I’m glad to announce my second open source project in Golang, following the feature flags API. My second package is all about word segmentation.

What is the word segmentation problem?

Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like thisisatest. Humans can immediately identify that the correct phrase should be this is a test. But for machines, this is a tricky problem.

An approach to this problem

A basic idea would be to use a dictionary, and then to try to split words if the current chunk of letters is a valid word. But then you run into issues with sentences like peanutbutter that you will split with this approach as pea nut butter instead of peanut butter.

The idea was to take advantage of frequencies of words in a corpus. This is where the concept of a n-gram is used. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.

For example, this is an extract of some unigrams in a corpus composed of 1,024,908,267,229 words distributed by the Linguistic Data Consortium.
used 421438139 go 421086358 b 419765694 work 419483948 last 417601616 most 416210411 music 414028837 buy 410780176 data 406908328 make 405084642 them 403000411 should 402028056

Using unigrams and bigrams, we can score an arrangement of words. This is what is done in the score method for example.

Concurrency and channels

This was also a great opportunity for me to work with channels, because some parts of the program can be run in parallel. I’m just starting to work around goroutines and channels, but I really like it!

Take a look at the source code and the documentation on GitHub: github.com/AntoineAugusti/wordsegmentation

What are feature flags?

Feature flags let you enable or disable some features of your application, for example when you’re under unexpected traffic or when you want to let some users try a new feature you’ve been working on. They decouple feature release and code deployment, so that you can release features whenever you want, instead of whenever the code happens to ship.

With this package, you can enable the access of a feature for:

specific user IDs

specific groups

a percentage of your user base

everyone

no one

And you can combine things! You can give access to a feature for users in the group dev or admin and for users 1337 and 42 if you want to.

What I’ve learned

I guess it’s a rather complete project because it involves a storage layer (a key-value store, with bolt), some logic around a simple model (what is a feature? How do we control access to a feature?) and an HTTP layer (with the default HTTP server and gorilla/mux). Moreover I’ve tried to write some tests, and it was really interesting to discover the “Go way” to do it!

Anyway, I’ve learned a lot and I’m fairly happy with the codebase, but if you spot anything that can be improved or that is wrong, please do get in touch with me (GitHub issues and tweets are perfect).

Antoine Augusti

Month: September 2015

Word segmentation library in Golang

What is the word segmentation problem?

An approach to this problem

Concurrency and channels

Feature flags API in golang

What are feature flags?

What I’ve learned