Atelier Convivialité
- March 10, 2010 5 reasons why your company should be distributed
-
March 10, 2010
The secret of successful company blogging
Thoughtful article on how to run a company blog.
- March 9, 2010 Localize countries, languages and more
-
March 6, 2010
Sphinx Search and i18n
Web Translate It’s search feature is powered by Sphinx, an open-source full-text search engine. To make the programming a little bit easier, I use Thinking Sphinx, a ruby connector between Sphinx and Active Record.
Sphinx is essentially used to search for strings. We currently have in database more than 400,000 translations (almost 500,000 when counting the different revisions), and this number is growing fast. Searching for anything directly against the database would be painfully slow and wouldn’t get any better.
Sphinx has been designed with indexing database content in mind: it scans the database and create its own index, which is used to search results quickly.
While Sphinx works great and is really easy to set up thanks to Pat Allan’s awesome documentation, the one thing that didn’t work out of the box was searching for text in foreign characters, which as you can imagine is a huge problem for a translation software.
Mapping foreign characters
The first problem I ran into is that it couldn’t find anything in chinese.
The solution I found was to map foreign characters into the Sphinx configuration file.
I ended up making this sphinx.yml file. It supports Chinese, Japanese, Korean and Arabic characters and a lot others.
I don’t really understand why this mapping is not the default. If you have a web application used by people using these kind of characters, then search won’t work out of the box. You can’t really control what your users put in your database, it is a broken feature if they can’t find what they are looking for.
Fixing the segmenter
After this, I could search and find strings in foreign characters, but I couldn’t find any strings in Chinese, Japanese or Korean (CJK) unless I typed the exact match.
The problem was because the default segmenter (the segmenter is what defines what is a word for Sphinx) didn’t work for these grammars. In English and many other languages a word is defined as a string separated by spaces. But there are no spaces to separate words in CJK scripts.
The solution is to add to Sphinx’s configuration file:
min_prefix_len: 1 min_word_len: 1 ngram_len: 1I hope someone will find this useful. I also made a ticket to include these changes in Thinking Sphinx.
-
March 5, 2010
Welcome to the (new) company blog
I often want to share something, rant or jot down some ideas about software development, business or internationalisation. I always felt like our product blog wasn’t well suited for this.
Some people just want to hear about what’s new on Web Translate It, and care very little about what’s in our heads or how to scale an API that serves more than 5,000 strings per second at peak time.
Also, we have secret projects, so it’s time to separate more the product from the company.
So this is it. This new blog will collect our thoughts and ideas on business, software development and internationalisation, and the blog you used to know at this address has permanently moved to blog.webtranslateit.com.