Go client for Updown


What is Updown?

Over the weekend, I’ve been working on creating a Go client for updown.io. Updown lets you monitor websites and online services for an affordable price. Checks can be performed for HTTP, HTTPS, ICMP and a custom TCP connection down to every 30s, from 4 locations around the globe. They also offer status pages, like the one I use for Teen Quotes. I find the design of the application and status pages really slick. For all these reasons, I use Updown for personal and freelance projects.

A Go REST client

I think that it’s the first time I wrote a REST API client in Go, and I feel pretty happy. My inspiration for the package came from the Godo package, the Go library for DigitalOcean. It helped me start and structure my files, structures and functions.

The source code is available on GitHub, under the MIT license. Here is a small glance at what you can do with it.

package main

import (

func main() {
    // Your API key can be retrieved at https://updown.io/settings/edit
    client := updown.NewClient("your-api-key", nil)
    // List all checks
    checks, HTTPResponse, err := client.Check.List()
    // Finding a token by an alias
    token, err := client.Check.TokenForAlias("Google")
    // Downtimes for a check
    page := 1 // 100 results per page
    downs, HTTPResponse, err := client.Downtime.List(token, page)

Enjoying working with Go again

I particularly enjoyed working with Go again, after a few months without touching it. I really like the integration with Sublime Text, the fast compilation, static typing, the golint (linter for Go code, that even takes into account variable names and comments) and go fmt (automatic code formatting) commands. I knew and I experienced once again that developing with Go is fast and enjoyable. You rapidly end up with a code that is nice to read, tested and documented.


As always, feedback, pull-requests, or kudos are welcomed! I did not achieved 100% coverage as I was quite lazy and opted for integration tests, meaning tests actually hit the real Updown API when they are performed.

Multiple deploy keys on the same machine – GitHub: key already in use


Github does not let you use the same SSH key as a deploy key for several projects. Knowing this, you’ve got 2 choices: edit the configuration of your 1st project and say that this SSH key is not longer a deploy key or find another solution.

Deleting the deploy key of the existing project

To know what is the project associated with your deploy key, you can run the command ssh -T -ai ~/.ssh/id_rsa [email protected] (adjust the path to your SSH key if necessary). Github will then great you with something like:

Hi AntoineAugusti/foo-project! You've successfully authenticated, but GitHub does not provide shell access.

From this point, solving your problem is just a matter of going to the settings of this repository and removing the deploy key.

The alternative: generating other SSH keys

We are going to generate a SSH key for each repository, you’ll see it’s not too much trouble.

  • First, generate a new SSH key with a comprehensive name with the command ssh-keygen -t rsa -f ~/.ssh/id_vendor_foo-project -C https://github.com/vendor/foo-project (replace vendor and foo-project).
  • Edit your ~/.ssh/config file to map a fake subdomain to the appropriate SSH key. You will need to add the following content:
    Host vendor_foo-project.github.com
        Hostname github.com
        IdentityFile ~/.ssh/id_vendor_foo-project

    This code maps a fake Github’s subdomain to the root domain and say that when connecting to the fake subdomain, we should automatically use the previously created SSH key.

  • Add the newly created SSH public key as a deploy key to the repository of your choice
  • Clone your Git repository with the fake subdomain: instead of using the URL given by GitHub (git clone [email protected]:vendor/foo-project.git) you will use git clone git@vendor_foo-project.github.com:vendor/foo-project.git
  • From now on, running git pull will connect to GitHub with the appropriate SSH key and GitHub will not complain 🙂

If you’ve already cloned the Git repository before, you can always change the remote URL to the Git server by editing the file .git/config of your project.

Happy deploys!

My experience as a mentor for students


Mentor what?

For the last 3 months, I have been a mentor for a few students on OpenClassrooms. OpenClassrooms is a French MOOC platform, visited by 2.5M people each month and they currently offer more than 1000 courses. They focus on technology courses for now: web development, mobile development, networking, databases for example. A course can be composed of textual explanations, videos, quizzes, practical sessions…

Courses are free, but you can pay a monthly fee to become a “Premium Plus” student, and thanks to this you will have a weekly 45 minutes / 1 hour session with someone experienced (student, professional, teacher…) to help you achieve your goals: getting certifications, finding an internship or starting your career in web development for instance. As a mentor, your primary goal is not to teach a course. Instead, you’re here as a support for students: you can help them understand a difficult part of a course, give them additional exercises, share with them valuable resources, look at their code and do a basic code review.

Mathieu Nebra en séance de mentorat
Mathieu Nebra (co-founder of OpenClassrooms) in a mentoring session

About “my students”

As an engineering student in a well recognised school in France, I’m used to be surrounded by lucky people: they are intelligent, they have good grades and one day they will get an engineering degree. This means that they will have a job nearly no matter what, and a well payed one. At OpenClassrooms, this is very different: a fair amount of students have had difficulties (left school early, were not interested in their first years at university, did some small jobs here and there to pay the rent…) and now they are working hard to improve their life. Web development is a fantastic opportunity: you can learn it from home, you only need a computer (and a cheap one is perfectly okay) and you can find a lot of learning resources for free on the Internet. The job market is not too crowded, and there is a good chance that you can find a job in a local web agency if you know HTML5, CSS3, a PHP framework and some basic jQuery. No need to work long hours, to wake up during the night, to fight to find a part time job to pay your rent; you can make a living by typing text in a text editor.

It has been a very valuable experience for me to listen to people that had bad times, had troubles in their life and now are dedicated to get better, to learn stuff and they just need advices to achieve what they want.

I am a mentor, but I learn

I’m helping my students mostly around web technologies. And this means that I’m supposed to know a lot of stuff about HTML5 (canvas, you know it?), CSS3 (flexbox anyone?), naked PHP (good ol’ PDO API) and JavaScript. Clearly, this is not the case. I don’t even do web development on a monthly basis. At first, I was a bit worried: am I going to be able to remember how I did it, a few years ago? How can you do this feature without a framework? Can I still read a mix of HTML / CSS / PHP, all in the same file? I was surprised, but the answer was yes, and it was very interesting to witness how my brain can actually remember things I did years ago, and how fast I can retrieve this information (just by thinking or by doing the right Google query).

I was also surprised by how broad my role is. Sure, students have some difficulties understanding every aspect of oriented object principles, and I have to go over some concepts multiple times, but who doesn’t? What they really need is not a simple technical advisor. They need to hear from someone experienced that it is perfectly fine to not understand OOP in just 2 weeks, that it is fine to forget method names or to mix up language syntaxes when you write for the first time HTML, CSS, JavaScript and PHP during the same day.

They need to hear from someone that they are doing great, and to remember what they have learned during the last month or so. I found that it helps them a lot to keep a simple schedule somewhere: “for next week, I want to have done these sections from this course, and I need to start looking at this also”. When you look back, they are happy to see that indeed they have finished and done successfully quizzes / activities for multiple courses recently. It is a tremendous achievement for students to know that they have learned something, that they are actually getting somewhere and that their knowledge is growing.

What next?

So far, it has been an incredible experience and I think I have learned a lot, and I do hope that students have learned valuable things thanks to me. I am feeling good because I see that I can help people, I can give back to the community and I can share my passion with people that are interested and deeply motivated.

Sounds like something you want to do? Visit this page.

Testing an os.exit scenario in Golang


Today, I ran into an issue. I wanted to test that a function logged a fatal error when something bad happened. The problem with a fatal log message is that it calls os.Exit(1) after logging the message. As a result, if you try to test this by calling your function with the required arguments to make it fail, your test suite is just going to exit.

Suppose you want to test something like this:

package foo

import (

func Crashes(i int) {
  if i == 42 {
    log.Fatal("It crashes because you gave the answer")

Well, this is not so easy as explained before. It turns out that the solution is to start a subprocess to test that the function crashes. The subprocess will exit, but not the main test suite. This is explained in a talk about testing techniques given in 2014 by Andrew Gerrand. If you want to check that the fatal message is something specific, you can inspect the standard error by using the os.exec package. Finally, the code to test the crashing part of the previous function would be the following:

package foo

import (

func TestCrashes(t *testing.T) {
  // Only run the failing part when a specific env variable is set
  if os.Getenv("BE_CRASHER") == "1" {

  // Start the actual test in a different subprocess
  cmd := exec.Command(os.Args[0], "-test.run=TestCrashes")
  cmd.Env = append(os.Environ(), "BE_CRASHER=1")
  stdout, _ := cmd.StderrPipe()
  if err := cmd.Start(); err != nil {

  // Check that the log fatal message is what we expected
  gotBytes, _ := ioutil.ReadAll(stdout)
  got := string(gotBytes)
  expected := "It crashes because you gave the answer"
  if !strings.HasSuffix(got[:len(got)-1], expected) {
    t.Fatalf("Unexpected log message. Got %s but should contain %s", got[:len(got)-1], expected)

  // Check that the program exited
  err := cmd.Wait()
  if e, ok := err.(*exec.ExitError); !ok || e.Success() {
    t.Fatalf("Process ran with err %v, want exit status 1", err)

Not so readable, definitely feels like a hack, but it does the job.

Limit the number of goroutines running at the same time


Recently, I was working on package that was doing network requests inside goroutines and I encountered an issue: the program was really fast to finish, but the results were awful. This was because the number of goroutines running at the same time was too high. As a result, the network was congested, too many sockets were opened on my laptop and the final performance was degraded: requests were slow or failing.

In order to keep the network healthy while maintaining some concurrency, I wanted to limit the number of goroutines making requests at the same time. Here is a sample main file to illustrate how you can control the maximum number of goroutines that are allowed to run concurrently.

package main

import (

// Fake a long and difficult work.
func DoWork() {
	time.Sleep(500 * time.Millisecond)

func main() {
	maxNbConcurrentGoroutines := flag.Int("maxNbConcurrentGoroutines", 5, "the number of goroutines that are allowed to run concurrently")
	nbJobs := flag.Int("nbJobs", 100, "the number of jobs that we need to do")

	// Dummy channel to coordinate the number of concurrent goroutines.
	// This channel should be buffered otherwise we will be immediately blocked
	// when trying to fill it.
	concurrentGoroutines := make(chan struct{}, *maxNbConcurrentGoroutines)
	// Fill the dummy channel with maxNbConcurrentGoroutines empty struct.
	for i := 0; i < *maxNbConcurrentGoroutines; i++ {
		concurrentGoroutines <- struct{}{}

	// The done channel indicates when a single goroutine has
	// finished its job.
	done := make(chan bool)
	// The waitForAllJobs channel allows the main program
	// to wait until we have indeed done all the jobs.
	waitForAllJobs := make(chan bool)

	// Collect all the jobs, and since the job is finished, we can
	// release another spot for a goroutine.
	go func() {
		for i := 0; i < *nbJobs; i++ {
			// Say that another goroutine can now start.
			concurrentGoroutines <- struct{}{}
		// We have collected all the jobs, the program
		// can now terminate
		waitForAllJobs <- true

	// Try to start nbJobs jobs
	for i := 1; i <= *nbJobs; i++ {
		fmt.Printf("ID: %v: waiting to launch!\n", i)
		// Try to receive from the concurrentGoroutines channel. When we have something,
		// it means we can start a new goroutine because another one finished.
		// Otherwise, it will block the execution until an execution
		// spot is available.
		fmt.Printf("ID: %v: it's my turn!\n", i)
		go func(id int) {
			fmt.Printf("ID: %v: all done!\n", id)
			done <- true

	// Wait for all jobs to finish

This file is available as a gist on GitHub if you find it more convenient.

Sample runs

For the command time go run concurrent.go -nbJobs 25 -maxNbConcurrentGoroutines 10:

ID: 1: waiting to launch!
ID: 1: it's my turn!
ID: 2: waiting to launch!
ID: 2: it's my turn!
ID: 3: waiting to launch!
ID: 3: it's my turn!
ID: 4: waiting to launch!
ID: 4: it's my turn!
ID: 5: waiting to launch!
ID: 5: it's my turn!
ID: 6: waiting to launch!
ID: 6: it's my turn!
ID: 7: waiting to launch!
ID: 7: it's my turn!
ID: 8: waiting to launch!
ID: 8: it's my turn!
ID: 9: waiting to launch!
ID: 9: it's my turn!
ID: 10: waiting to launch!
ID: 10: it's my turn!
ID: 11: waiting to launch!
ID: 1: all done!
ID: 9: all done!
ID: 11: it's my turn!
ID: 12: waiting to launch!
ID: 12: it's my turn!
ID: 7: all done!
ID: 13: waiting to launch!
ID: 5: all done!
ID: 13: it's my turn!
ID: 14: waiting to launch!
ID: 4: all done!
ID: 14: it's my turn!
ID: 8: all done!
ID: 15: waiting to launch!
ID: 15: it's my turn!
ID: 16: waiting to launch!
ID: 16: it's my turn!
ID: 10: all done!
ID: 17: waiting to launch!
ID: 2: all done!
ID: 17: it's my turn!
ID: 18: waiting to launch!
ID: 18: it's my turn!
ID: 3: all done!
ID: 19: waiting to launch!
ID: 6: all done!
ID: 19: it's my turn!
ID: 20: waiting to launch!
ID: 20: it's my turn!
ID: 21: waiting to launch!
ID: 20: all done!
ID: 16: all done!
ID: 17: all done!
ID: 12: all done!
ID: 21: it's my turn!
ID: 19: all done!
ID: 11: all done!
ID: 14: all done!
ID: 18: all done!
ID: 15: all done!
ID: 13: all done!
ID: 22: waiting to launch!
ID: 22: it's my turn!
ID: 23: waiting to launch!
ID: 23: it's my turn!
ID: 24: waiting to launch!
ID: 24: it's my turn!
ID: 25: waiting to launch!
ID: 25: it's my turn!
ID: 24: all done!
ID: 21: all done!
ID: 22: all done!
ID: 25: all done!
ID: 23: all done!
0,28s user 0,05s system 18% cpu 1,762 total

For the command time go run concurrent.go -nbJobs 10 -maxNbConcurrentGoroutines 1:

ID: 1: waiting to launch!
ID: 1: it's my turn!
ID: 2: waiting to launch!
ID: 1: all done!
ID: 2: it's my turn!
ID: 3: waiting to launch!
ID: 2: all done!
ID: 3: it's my turn!
ID: 4: waiting to launch!
ID: 3: all done!
ID: 4: it's my turn!
ID: 5: waiting to launch!
ID: 4: all done!
ID: 5: it's my turn!
ID: 6: waiting to launch!
ID: 5: all done!
ID: 6: it's my turn!
ID: 7: waiting to launch!
ID: 6: all done!
ID: 7: it's my turn!
ID: 8: waiting to launch!
ID: 7: all done!
ID: 8: it's my turn!
ID: 9: waiting to launch!
ID: 8: all done!
ID: 9: it's my turn!
ID: 10: waiting to launch!
ID: 9: all done!
ID: 10: it's my turn!
ID: 10: all done!
0,32s user 0,03s system 6% cpu 5,274 total

Questions? Feedback? Hit me on Twitter @AntoineAugusti

Openness for engineering teams


As a student, I am quite often looking at companies to see what they are doing, to understand the market and discover trends. As an engineering student, I am on the lookout for technical content, written by engineers. I discovered recently that I value a lot openness for engineering teams. Being open can be done in different ways:

  • Having a technical blog. You can understand this in multiple ways. First, you can have a blog where you talk about new features, new releases of your API / SDK. This one is quite common. The second one is really rare and very valuable to me: you talk about your engineering process, your hiring process, you share reports of outages. If you have open source projects, you have a blog post to let the technical community know about it.
  • Involvement in communities. You can be involved in communities in multiple ways: regularly sending members of your team to local meetups (not just attending if you can. Presenting and volunteering are awesome), being visible in conferences, giving explicit credit to open source solutions you are using (or giving money to them if you can afford to), host hackathons or hack days at your office. Be explicit about causes you care about and defend them.
  • Open source. Whether you contribute to open source projects or you open source some of your projects, involvement in the community is a great way to gain some exposure, let people know which technologies you are using and giving back to the community.

An update to the Joel Test?

Maybe some of these points will be in an updated “Joel Test” in the future, even if some people already say that it is partially antiquated. Personally, I would add the following questions to an updated version of the Joel Test:

  • Do you support developer education by attending conferences, purchasing books (or something equivalent)?
  • Do you have a simple, documented process to adopt new tools your team uses?
  • Do you have an engineering blog where you talk about your processes, ideas, beliefs and failures?

You can’t have it all

Being able to answer “Yes” to every questions above seems fairly difficult, and really impossible for small engineering teams. If your company is 1 year old and you are 2 engineers, you cannot put all these things in place. But as they say, “practice makes perfect”, so try to keep these goals in mind. Giving an awesome work environment to your engineers will make them productive, happy to work and so much more! Great engineering teams attract great engineers.

Developing and deploying a modulus checking API


Following my latest post about a Go package to validate UK bank account numbers, I wanted to offer a public API to let people check if a UK bank account number is valid or not. I know that offering a Go package is not ideal for everyone because for the moment Go is not everywhere in the tech ecosystem, and it’s always convenient to have an API you can send requests to, especially in a frontend context. My goal was to offer a JSON API, supporting authentication thanks to a HTTP header and with rate limits. With this, in the future you could adapt rate limits to some API keys, if you want to allow a larger amount of requests for some clients.

Packages I used

I wanted to give cloudflare/service a go because it lets you build quickly JSON APIs with some default endpoints for heartbeat, version information, statistics and monitoring. I used etcinit/speedbump to offer the rate limiting functionality and it was very easy to use. Note that the rate limiting functionality requires a Redis server to store request counts. Finally, I used the famous codegangsta/negroni to create middlewares to handle API authentication and rate limits and keeping my only controller relatively clean.

Deploying behind Nginx

My constraints were the following:

  • The API should only be accessible via HTTPS and HTTP should redirect to HTTPS.
  • The Golang server should run on a port > 1024 and the firewall will block access to everything but ports 22, 80 and 443
  • The only endpoints that should be exposed to the public are /verify, /version and /heartbeat. Statistics and monitoring should be accessible by administrators on localhost through HTTP

I ended up with this Nginx virtual host to suit my needs, I’m not sure if it can be simpler:

geo $is_localhost {
  default 0; 1;

server {
    listen 80;
    listen 443 ssl;

    server_name modulus.antoine-augusti.fr localhost.antoine-augusti.fr;

    ssl_certificate /etc/nginx/ssl/modulus.antoine-augusti.fr.crt;
    ssl_certificate_key /etc/nginx/ssl/modulus.antoine-augusti.fr.key;

   if ($is_localhost) {
      set $test A;

    if ($scheme = http) {
      set $test "${test}B";
    # Redirect to HTTPS if not connecting from localhost
    if ($test = B) {
      return 301 https://$server_name$request_uri;
    # Only the following endpoints are accessible to people not on localhost
    location ~ ^/(verify|heartbeat|version)  {
      include sites-available/includes/dispatch-golang-server;

    # Default case
    location / {
      # Not on localhost? End of game
      if ($is_localhost = 0) {
        return 403;
      # Forward request for people on localhost
      include sites-available/includes/dispatch-golang-server;

And for sites-available/includes/dispatch-golang-server:

proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header Host $host;

With this, I can still access the reserved endpoints by opening a SSH tunnel first with ssh -L4242: [email protected] and going to http://localhost.antoine-augusti.fr:4242/stats after.

Note that the Golang server is running on port 8080 and it should be monitored by Supervisor or whatever you want to use.

Grabbing the code and a working example

First of all, the API is available on GitHub under the MIT license so that you can deploy and adapt it yourself. If you want to test it first, you can use the API key foo against the base domain https://modulus.antoine-augusti.fr. Here is a cURL call for the sake of the example:

curl -H "Content-Type: application/json" -H "Api-Key: foo" -X POST -d '{"sort_code": "308037", "account_number": "12345678"}' https://modulus.antoine-augusti.fr/verify

Note that this API key is limited to 5 requests per minute. You’ve been warned 🙂 Looking for more requests per month or SLA, drop me a line.

Go challenge: validating UK bank account numbers


As I was reading through the SEPA specification, I found that it was not that simple to check if a UK bank account number was valid or not. If you’re not familiar with UK banks, they don’t use IBAN to transfer money within the UK, but a combination of a sort code and an account number. A sort code identifies the bank’s branch and each account has got an account number. A sort code is a 6 digits number and an account number can be between 6 and 11 digits, but most of them are 8 digits long.

For example, here is a valid UK bank account:

  • Sort code: 107999
  • Account number: 88837491

Algorithms to check if a UK bank account is valid

A very common way to check if a number (bank account, credit card, parking ticket…) is valid, is to apply a modulus algorithm. You perform an operation on each digit (addition, multiplication by a weight, substitution…), when you reach the end you divide by a specific number and you check that the remainder of the division is equal to something. Seems easy, right? Well, this is not that simple for UK bank accounts. In fact, if you want to go through the official specification on the Vocalink website, you will see that they use 2 algorithms, but they have also 15 exceptions to take into account (and some of them are weird or tricky to handle!). You will need to adapt the way you compute the modulus value according to a weight table also.

From the specification to a package

Reading the specification was interesting, but what really motivated me to code a Go package to solve this problem was the fact that test cases where provided in the specification! What a dream: the specification offers you 34 test cases, and they cover nearly all the exceptions. I jumped on the opportunity, it’s not that often that you are offered with a way to check that what you have done is actually right. In fact, I followed a Test Driven Developemnt aproach and it really guided me during the development and especially the refactoring.

Getting the code

The code is available on GitHub under the MIT license and should be well documented and tested. As always, pull requests and bug reports are welcome!

Here is an example:

package main

import (


func main() {
    // Read the modulus weight table and the sorting
    // code substitution table from the data folder
    parser := parsers.CreateFileParser()

    // The resolver handles the verification of the validity of
    // bank accounts according to the data obtained by the parser
    resolver := resolvers.NewResolver(parser)

    // This helper method handles special cases for
    // bank accounts from:
    // - National Westminster Bank plc (10 or 11 digits with possible presence of dashes, for account numbers)
    // - Co-Operative Bank plc (10 digits for account numbers)
    // - Santander (9 digits for account numbers)
    // - banks with 6 or 7 digits for account numbers
    bankAccount := models.CreateBankAccount("089999", "66374958")

    // Check if the created bank account is valid against the rules

Continuous integration and code coverage in Golang


It took me some time to find the right setup and the right tools to achieve something not that complicated: continuous integration and coverage reports for Golang projects hosted on GitHub.

I’m happy to share my configuration with you, hopefully it will save you some time. I’m using Travis CI for the continuous integration platform and Codecov for code coverage reports. Both are free and easy to setup: you can get just log in using your GitHub account, you will be up and running in under 5 minutes.

Here is the Travis file (.travis.yml) I use:

language: go
  - go get golang.org/x/tools/cmd/vet
  - go get github.com/modocache/gover
  - go get github.com/vendor/package/...
  # Vet examines Go source code and reports suspicious construct
  - go vet github.com/vendor/package...
  # Run the unit tests suite
  - go test -v ./...
  # Collect coverage reports
  - go list -f '{{if len .TestGoFiles}}"go test -coverprofile={{.Dir}}/.coverprofile {{.ImportPath}}"{{end}}' ./... | xargs -i sh -c {}
  # Merge coverage reports
  - gover . coverprofile.txt
  # Send coverage reports to Codecov
  - bash < (curl -s https://codecov.io/bash) -f coverprofile.txt

Replace github.com/vendor/package with your GitHub URL and you're good to go! You will be protected against yourself or contributors for your package. Unit tests will not break and coverage will not decrease. Or at least you will know when it happens!

Bonus: fancy badges

I like to put at the beginning of every README file a few information:

  • The status of the latest build (green is reassuring)
  • The software license, so that people immediately know if it's okay to use it for their project
  • A link to the GoDoc website, for documentation
  • The percentage of code covered by unit tests

If you want to do the same, here is what you can write at the very top of your README.md file:

# Travis CI for the master branch
[![Travis CI](https://img.shields.io/travis/vendor/package/master.svg?style=flat-square)](https://travis-ci.org/vendor/package)
# Note that this is for the MIT license and it expects a LICENSE.md file
[![Software License](https://img.shields.io/badge/License-MIT-orange.svg?style=flat-square)](https://github.com/vendor/package/blob/master/LICENSE.md)
# Link to GoDoc
# Codecov for the master branch
[![Coverage Status](http://codecov.io/github/vendor/package/coverage.svg?branch=master)](http://codecov.io/github/vendor/package?branch=master)

One more time, don’t forget to replace vendor/package (even in URLs) with your details and you’re good to go!


Head to AntoineAugusti/colors to see what it looks like.

Happy coding!

Word segmentation library in Golang


I’ve been into Golang lately, and today I’m glad to announce my second open source project in Golang, following the feature flags API. My second package is all about word segmentation.

What is the word segmentation problem?

Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like thisisatest. Humans can immediately identify that the correct phrase should be this is a test. But for machines, this is a tricky problem.

An approach to this problem

A basic idea would be to use a dictionary, and then to try to split words if the current chunk of letters is a valid word. But then you run into issues with sentences like peanutbutter that you will split with this approach as pea nut butter instead of peanut butter.

The idea was to take advantage of frequencies of words in a corpus. This is where the concept of a n-gram is used. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.

For example, this is an extract of some unigrams in a corpus composed of 1,024,908,267,229 words distributed by the Linguistic Data Consortium.

used 421438139
go 421086358
b 419765694
work 419483948
last 417601616
most 416210411
music 414028837
buy 410780176
data 406908328
make 405084642
them 403000411
should 402028056

Using unigrams and bigrams, we can score an arrangement of words. This is what is done in the score method for example.

Concurrency and channels

This was also a great opportunity for me to work with channels, because some parts of the program can be run in parallel. I’m just starting to work around goroutines and channels, but I really like it!

Take a look at the source code and the documentation on GitHub: github.com/AntoineAugusti/wordsegmentation