Lessons Learned: Keeping a webservice consistent
Here are the slides of the talk I held on the 3rd of February 2011 at the Rug::B Berlin.
ActiveRecord Unit Tests and MySQL default encoding
Recently, I’ve been trying to run the unit tests of ActiveRecord of the Rails 2-3-stable branch with the MySQL adapter – I thought I had actually found a bug in an obscure area of association handling with unsaved records.
There were two problems with getting started: First of all, the tests didn’t start because of an uninitalized constant. That was solved as so many problems nowadays by Stack Overflow:
http://stackoverflow.com/questions/1145318/getting-uninitialized-constant-error-when-trying-to-run-tests
Thanks ! One uninstall of test-unit later, some of the unit tests failed with this error:
1) Error:
test_optionally_validates_length_of_using_within_on_create_utf8(ValidationsTest):
ActiveRecord::StatementInvalid: Mysql::Error: Incorrect string value: '\xE4\xB8\x80\xE4\xBA\x8C...' for column 'title' at row 1: INSERT INTO `topics` (`group`, `title`, `author_name`, `approved`, `parent_title`, `bonus_time`, `replies_count`, `type`, `content`, `parent_id`, `author_email_address`, `written_on`, `last_read`) VALUES(NULL, '一二三四五六七八九', NULL, 1, NULL, NULL, 0, NULL, 'whatever', NULL, 'test@test.com', '2011-01-27 10:41:11', NULL)
./test/cases/../../lib/active_record/connection_adapters/abstract_adapter.rb:227:in `log'
./test/cases/../../lib/active_record/connection_adapters/mysql_adapter.rb:324:in `execute_without_query_record'
./test/cases/helper.rb:36:in `execute'
./test/cases/../../lib/active_record/connection_adapters/abstract/database_statements.rb:259:in `insert_sql'
./test/cases/../../lib/active_record/connection_adapters/mysql_adapter.rb:334:in `insert_sql'
./test/cases/../../lib/active_record/connection_adapters/abstract/database_statements.rb:44:in `insert_without_query_dirty'
./test/cases/../../lib/active_record/connection_adapters/abstract/query_cache.rb:19:in `insert'
./test/cases/../../lib/active_record/base.rb:2961:in `create_without_timestamps'
./test/cases/../../lib/active_record/timestamp.rb:53:in `create_without_callbacks'
./test/cases/../../lib/active_record/callbacks.rb:266:in `create'
./test/cases/../../lib/active_record/base.rb:2927:in `create_or_update_without_callbacks'
./test/cases/../../lib/active_record/callbacks.rb:250:in `create_or_update'
./test/cases/../../lib/active_record/base.rb:2577:in `save_without_validation'
./test/cases/../../lib/active_record/validations.rb:1090:in `save_without_dirty'
./test/cases/../../lib/active_record/dirty.rb:79:in `save_without_transactions'
./test/cases/../../lib/active_record/transactions.rb:229:in `send'
./test/cases/../../lib/active_record/transactions.rb:229:in `with_transaction_returning_status'
./test/cases/../../lib/active_record/connection_adapters/abstract/database_statements.rb:136:in `transaction'
./test/cases/../../lib/active_record/transactions.rb:182:in `transaction'
./test/cases/../../lib/active_record/transactions.rb:228:in `with_transaction_returning_status'
./test/cases/../../lib/active_record/transactions.rb:196:in `save'
./test/cases/../../lib/active_record/transactions.rb:208:in `rollback_active_record_state!'
./test/cases/../../lib/active_record/transactions.rb:196:in `save'
./test/cases/validations_test.rb:1146:in `test_optionally_validates_length_of_using_within_on_create_utf8'
./test/cases/../../lib/active_record/test_case.rb:57:in `with_kcode'
./test/cases/validations_test.rb:1137:in `test_optionally_validates_length_of_using_within_on_create_utf8'
./test/cases/../../../activesupport/lib/active_support/testing/setup_and_teardown.rb:62:in `__send__'
./test/cases/../../../activesupport/lib/active_support/testing/setup_and_teardown.rb:62:in `run'
AR tests assume that the database is set to the utf8 character encoding, and rightly so – I think anybody not using Unicode for everything needs to have very very good reasons for doing so. The problem is, though, that MySQL still ships with latin-1 as the default encoding, and nowhere in the notes on running AR tests was this mentioned. Well, here’s the fix:
mysql> drop database activerecord_unittest; Query OK, 100 rows affected (7.28 sec) mysql> drop database activerecord_unittest2; Query OK, 1 row affected (0.10 sec) mysql> create database activerecord_unittest DEFAULT CHARACTER SET 'utf8' DEFAULT COLLATE 'utf8_general_ci'; Query OK, 1 row affected (0.00 sec) mysql> create database activerecord_unittest2 DEFAULT CHARACTER SET 'utf8' DEFAULT COLLATE 'utf8_general_ci'; Query OK, 1 row affected (0.00 sec)
I also encountered this problem:
1) Error:
test_load_schema(AAACreateTablesTest):
ActiveRecord::StatementInvalid: ./test/cases/../../lib/active_record/connection_adapters/abstract_adapter.rb:227:in `log': Mysql::Error: PROCEDURE topics already exists: CREATE PROCEDURE topics() SQL SECURITY INVOKER
BEGIN
select * from topics limit 1;
END
I guess AR doesn’t clean up the stored procedures after the tests ran – AFAIK, they’re not saved with the tables, so they persist after a test run. Dropping and recreating the DB works, though.
At this point I discovered that our Bug had already been fixed in the 2-3-stable branch, merged that, and went home.
Another job well done !
Von Affen, für Affen
Gerade wiedergefunden: Ein altes Bild aus Studienzeiten, als Firmenlogo für diverse Studienarbeiten:

Rails Windows Installer
Just a quick repost of this article: http://www.rubyinside.com/rails-installer-ruby-and-rails-on-windows-in-a-single-install-4201.html , and the link to the Ruby Windows Installer .
Although I personally don’t develop on Windows anymore, I started like many others with the Ruby One-Click installer, and I was quite thankful at the time that it just worked – let’s hope the rails installer provides the same ease of getting started, only with even more of a complete setup.
War Story: Finding bottlenecks in Ruby code with ruby-prof
For my job at Travel IQ , I’ve recently had the pleasure of trying to find out why our so-called cluster job suddenly started to run extremely slowly.
This job is a rake task that runs each night on our production machines – it aggregates the different hotel descriptions of the hotel providers we’ve got partnerships with (like HRS or booking.com) and tries to find out which different descriptions actually mean the same real hotel.
The algorithms used in the task are, I dare say, pretty sophisticated and do a really good job of sifting through a lot of descriptions. The praise here goes to the original author, Dr. Florian Odronitz – a more in-depth look at the clustering process might be another post some day. Due to the amount of data, the task never really ran fast as such – a couple of hours each night, but that was fast enough.
Recently, however, it would run long enough to be still running when the next job started – 24 hours later. The last job we’ve kept alive to finish ran for three weeks X(
It looked like someone had to do something. Unfortunately, that someone was me. Nobody had touched the code in months, and the number of descriptions didn’t change significantly either (that only really happens when we add or remove a partnership). Looking through the code didn’t produce any immediately obvious bottleneck.
My first guess was ActiveRecord – usually, when massaging data, the slowest parts of the process tend to be doing more queries than necessary (which is one of the downsides of using a high-level ORM system like AR – it’s easy to forget that you’re working with a DB under the object-oriented code).
Watching the queries while the task ran didn’t show any queries during the slow parts of the task, though. Dang.
Time to break out the big guns, then: Profiling the task .
Running that on my local machine produced a nice long and interesting profile that, unfortunately, didn’t look like anything was out of order. Only lots of calls to methods I was expecting to be called, and the most time doing actual work was spent down in Geokit’s distance calculation method. Since closeness to each other is the number one criterion for grouping different hotels together, this is expected – we basically need to get an accurate distance for each description against each other description. We also didn’t update the gem – so I still was at a loss.
A colleague put me to it: I had been comparing apples to oranges – I had run the profile locally, not with the production data. ’Doh !
Very, very carefully enabling profiling on a production server and running it there produced a different picture. Unfortunately, I don’t have the profile saved in a blogpost-friendly form. But the culprit was quite obvious: For about 5000 comparisons between hotels, Array#index was called about 15000 times. And that again called Array#=== 50 million times. Ouch. The whole measured part of the code spent 75% of its time calling that method.
The actual code calling the index method wasn’t even directly connected with the clustering:
on_blacklist = blacklist ? !!blacklist[1].index([cluster_element.send(blacklist[0]), candidate.send(blacklist[0])].sort) : false on_whitelist = whitelist ? !!whitelist[1].index([cluster_element.send(whitelist[0]), candidate.send(whitelist[0])].sort) : false
That might be a little obtuse for the uninitiated. What it basically does is using the current cluster_element (a hotel description) to create a signature (an Array instance), and then using Array#index to see if the element is in the blacklist and/or whitelist array.
The problem here is that the blacklist array is currently about 10000 elements long – and that was exactly what was different between my local machine and the production data. And this was what had changed over time – the black/whitelists are used to manually override clustering results, and our product team had been busy weeding out the last erronous clusters.
This is such a classic mistake – since the above code is in an inner loop of comparisons, the amount of work to be done goes up non-linearly with each added black/whitelist entry.
The solution was also simple. I just changed the black/whitelist arrays to be Hashes, and did the lookup by seeing if the signature is a key in the Hash. Generally, this is a good habit to get into – when you do lookups, use the data structure that was designed for it !
After that, the time used for one cluster work packet dropped from 18 minutes to 30 seconds. Case closed. Write smug blog post.
One final note: People looking at profile data for the first time should not be discouraged at the mass of data a profiling usually provides. While people writing about analyzing profiles usually gloss over it (me included in this post), the process is, at least for me, not very straightforward. It involves a lot of going back and forth, trying to make sense of method called (you often see methods being called you never use directly), and looking up what the terminology used in the reports means. And you need to be quite familiar with the code to have the mental map that lets you see anything that doesn’t fit into it.
Setting the program name with $0
Here’s a quick tip that might be useful: In Ruby, you can set the title of the current Ruby process by assigning a string to $0. At Travel IQ, we’ve found this especially useful for long-running rake tasks that clean up and massage our data at night. Some of those rake tasks were running very long – and we were scratching our heads as to what they might be doing.
So we simply set the program name in every rake task and its dependencies, and the next time around, a look at htop revealed where we’ve been stuck.
One caveat, though: It seems that on Ubuntu machines, the maximum length of the program name is determined when the process is started, i.e. setting it later to something longer truncates the new value to the length of the first: http://www.ruby-forum.com/topic/187083
Hauseingänge
Ein Haiku über den Grund, warum man als Radfahrer nicht auf dem Gehweg an Häusern entlangfährt, auch wenn die Straße noch so knochenzermürbend kopfsteingepflastert ist:
Schnell am Hauseingang vorbei. Kinder im Spielwahn Ärmchen in Speichen.
Das geht nur auf Deutsch
Im Artikel über Salmonellen (nicht fragen warum ich den angeschaut habe) ist ein Bild verlinkt mit der Unterschrift:
Sekundärelektronenmikroskopaufnahme von Salmonellen (rot eingefärbt)
Sekundärelektronenmikroskopaufnahme, was für ein tolles Wort :)
Vorläufiger Abschlußbericht zum Projekt "Rauchen aufgeben"
Ich rauche seit meinem 16. Geburtstag. An diesem Tag hatte mein Vater mir auf mein Bitten hin eine seiner alten Pfeifen, das Zubehör, und etwas Tabak überlassen. Meine Eltern sind (bis heute) beide Raucher (sie Zigaretten, er Zigarillos), und folgten in der Erziehung dem Grundgedanken, daß man gewisse Dinge als schlicht nicht verhindern kann – und dann kann man wenigstens für kontrollierte Umstände sorgen.
Und so rauchte ich ab dann gelegentlich mal eine Pfeife. Das gefiel mir gut und hatte etwas abenteuerliches, vor allem, wenn man beim Angeln, versteckt hinter Schilf, gemütlich vor sich hin paffen konnte.
Auf Pfeife folgten dann alsbald Experimente mit Drehtabak und anderem Rauchgut. Schließlich war ich vollwertiger Nikotinsüchtiger und qualmte so meine Schachtel Nil oder Chesterfield am Tag weg. Und nachts in der Kneipe noch eine. Die Zigarette und ich gehörten zusammen, als Anosmie-Kranker hatte ich kein Problem mit dem Geruch, und in WG und später eigener Wohnung war Rauchen nie ein Problem.
Warum also aufhören ?
Weil es mich langsam, aber sicher, umgebracht hat. Meine Hausärztin hatte mir letztes Jahr während einer Routineuntersuchung bereits eingeschränkte Lungenfunktion bescheinigt, und ich merkte selber, daß ich kurzatmiger wurde. Erkältungen dauerten länger und zogen einen tage- und wochenlangen Hustenschweif hinter sich her.
Manchmal bemerkte ich beim Einschlafen, daß mein Brustkorb leise Pfeifgeräusche beim Atmen machte, wie eine fast verstopfte Nase.
Bei der letzten Erkältung vor ca. drei Wochen bekam ich dann eines Abends tatsächlich richtig schlecht Luft. Das war die Ohrfeige, die dem Faß den Boden ins Gesicht schlug.
Seit drei Wochen bin ich also rauchfrei, nach ca. 13 Jahren mittelmäßig intensivem Konsums. Meine Erkenntnisse zusammengefasst lauten:
- Das Rauchen aufzugeben ist beschissen schwer
- Rauchen ist eine richtige Droge
Erstaunt hat mich, daß ich weniger das Verlangen nach der Zigarette zu bändigen hatte, als vielmehr andere Entzugserscheinungen. Ich hatte Schwindelgefühl, Bauchweh, Depressionsgefühle, und allgemeine Schlaffheit, wie bei einer leichten Erkältung. Teilweise ging das mit einem rauschartigen Gefühl der Realitätsferne einher. Die Körperchemie ist nach vielen Jahren dauernden Zufuhrs mit einem externen Mittel einfach durcheinander, wenn man die Zufuhr dann abstellt.
Das vielbeschworene Geschmacksempfinden, das sich nach der Rauchaufgabe geradezu orgiastisch entfalten soll, ist bisher ausgeblieben. Dagegen hat sich bewahrheitet, daß man mit Alkohol die Lust nach der Zigarette wieder anfachen kann.
Ansonsten bleibt gelegentlich das Gefühl, irgendwie was vergessen zu haben, zum Beispiel nach einem kräftigen Essen. Ich hoffe, das stellt sich langsam ganz ein…
Alles in allem wünschte ich, ich hätte die Geschichte nie angefangen. Aber vielleicht ist es als Erfahrung nicht schlecht – wenn das Aufgeben von Nikotin schon schwer ist, wie fest müssen einen dann andere Drogen umklammern ? Als Präventativ ist es vielleicht nicht ganz negativ, wenn man sich nur leicht die Finger verbrennt, und sich dafür schwerere Verletzungen erspart.
Weird shit with Ruby's float formatting
>> 1.upto 50 do |x| f = ('1.' + ('1' * x) + '5' ); puts "%.#{x}f % #{f}: " + "%.#{x}f" % f.to_f end
%.1f % 1.15: 1.1
%.2f % 1.115: 1.11
%.3f % 1.1115: 1.111
%.4f % 1.11115: 1.1112
%.5f % 1.111115: 1.11112
%.6f % 1.1111115: 1.111112
%.7f % 1.11111115: 1.1111111
%.8f % 1.111111115: 1.11111111
%.9f % 1.1111111115: 1.111111112
%.10f % 1.11111111115: 1.1111111112
%.11f % 1.111111111115: 1.11111111112
%.12f % 1.1111111111115: 1.111111111111
%.13f % 1.11111111111115: 1.1111111111112
%.14f % 1.111111111111115: 1.11111111111111
%.15f % 1.1111111111111115: 1.111111111111112
%.16f % 1.11111111111111115: 1.1111111111111112
%.17f % 1.111111111111111115: 1.11111111111111116
%.18f % 1.1111111111111111115: 1.111111111111111160
%.19f % 1.11111111111111111115: 1.1111111111111111605
%.20f % 1.111111111111111111115: 1.11111111111111116045
%.21f % 1.1111111111111111111115: 1.111111111111111160454
%.22f % 1.11111111111111111111115: 1.1111111111111111604544
%.23f % 1.111111111111111111111115: 1.11111111111111116045436
%.24f % 1.1111111111111111111111115: 1.111111111111111160454357
%.25f % 1.11111111111111111111111115: 1.1111111111111111604543567
%.26f % 1.111111111111111111111111115: 1.11111111111111116045435665
%.27f % 1.1111111111111111111111111115: 1.111111111111111160454356650
%.28f % 1.11111111111111111111111111115: 1.1111111111111111604543566500
%.29f % 1.111111111111111111111111111115: 1.11111111111111116045435665001
%.30f % 1.1111111111111111111111111111115: 1.111111111111111160454356650007
%.31f % 1.11111111111111111111111111111115: 1.1111111111111111604543566500070
%.32f % 1.111111111111111111111111111111115: 1.11111111111111116045435665000696
%.33f % 1.1111111111111111111111111111111115: 1.111111111111111160454356650006957
%.34f % 1.11111111111111111111111111111111115: 1.1111111111111111604543566500069574
%.35f % 1.111111111111111111111111111111111115: 1.11111111111111116045435665000695735
%.36f % 1.1111111111111111111111111111111111115: 1.111111111111111160454356650006957352
%.37f % 1.11111111111111111111111111111111111115: 1.1111111111111111604543566500069573522
%.38f % 1.111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216
%.39f % 1.1111111111111111111111111111111111111115: 1.111111111111111160454356650006957352161
%.40f % 1.11111111111111111111111111111111111111115: 1.1111111111111111604543566500069573521614
%.41f % 1.111111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216141
%.42f % 1.1111111111111111111111111111111111111111115: 1.111111111111111160454356650006957352161407
%.43f % 1.11111111111111111111111111111111111111111115: 1.1111111111111111604543566500069573521614075
%.44f % 1.111111111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216140747
%.45f % 1.1111111111111111111111111111111111111111111115: 1.111111111111111160454356650006957352161407471
%.46f % 1.11111111111111111111111111111111111111111111115: 1.1111111111111111604543566500069573521614074707
%.47f % 1.111111111111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216140747070
%.48f % 1.1111111111111111111111111111111111111111111111115: 1.111111111111111160454356650006957352161407470703
%.49f % 1.11111111111111111111111111111111111111111111111115: 1.1111111111111111604543566500069573521614074707031
%.50f % 1.111111111111111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216140747070312
ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux], MBARI 0×6770, Ruby Enterprise Edition 2009.10
Thanks to Gerriet Riessen: http://gist.github.com/631981
Das Rauchen aufzugeben...
…ist wie verliebt sein: Man versucht ständig, sich abzulenken, und doch durchfährt einen alle paar Minuten aufs Neue schmerzhaft die Erkenntnis, daß da etwas ist, nach dem es einem inbrünstig verlangt, und das man nicht haben kann.
Und dann fühlt man sich, als wäre man erkältet. Naja, ab da hinkt der Vergleich etwas.
Encoding error while parsing JSON
You know, given how much JSON is loved by the Ruby and other communities, I sometimes think the tools are not quite as professional yet as I would like.
Case in point: An error I debugged today for Travel IQ, which occured occasionally in our API (which is public – become an affiliate now and earn us yourself some money !).
That’s a Rails 2.3.5 app, with the json and json_pure gems at 1.4.6 .
The helpful exception was:
JSON::ParserError: 345: unexpected token at '{"id":1920944,"name":"Homestead Seattle-Bellevue","street":"3700 132nd Ave Se","zip":"98006","city":"Bellevue"
...
The rest of the json fills some pages. Thanks, JSON gem, for this helpful message !
JSONLint also gave me this error:
syntax error, unexpected TINVALID at line 2 Parsing failed
At least it reformats the JSON and gave me the line.
With trial and error, I could narrow it down to the hotel description text:
>> puts "Homestead Seattle \x1A Bellevue" Homestead Seattle Bellevue
Alright, that’s an escaped ASCII character, the decimal value is 26, referenced as “Substitute” – one of the nonprintable ASCII characters.
Here is a very nice post explaining (for a different example, the famous non-breaking-space) different ways of escaping.
Back to my problem:
>> JSON.load("Das Homestead Seattle \032 Bellevue befindet sich in Bellevue, Wash.".to_json)
JSON::ParserError: 705: unexpected token at '"Das Homestead Seattle \u001a Bellevue befindet sich in Bellevue, Wash."'
from /usr/local/lib/ruby/gems/1.8/gems/json-1.4.6/lib/json/common.rb:146:in `parse'
from /usr/local/lib/ruby/gems/1.8/gems/json-1.4.6/lib/json/common.rb:146:in `parse'
Now that’s just weak. The to_json method produces JSON that can’t be parsed by JSON.parse ? I think the to_json method originates somewhere in Rails, but still…
Oh well, in this case it worked by replacing the to_json call with JSON.dump .
About autocompleters and page caching
At Travel IQ , we’ve recently had to spend some time revising how our autocompleter for the cities and regions search form work – and came across something that surprised us.
For the impatient, here’s the moral of the story up front: Caching can be quite harmful to performance when done for the wrong thing .
Sounds like an obvious truism (“Something is bad when done wrong”), but let me elaborate:
We’ve always used page caching for the backend part of the autocompleters. The SQL that actually retrieves the results had to join across a couple of tables, needs partial matching of strings, and has an elaborate order clause, so it wasn’t very fast. No biggy: Most of the requests will be the same anyway (big cities like Berlin), so we’ll just cache the results as files on the disk and be done with it – repeated requests will be as fast as can be, as the responses are just files delivered by the webserver straight from the disk. Rails makes this dead easy.
That worked well until we’ve started airing our TV spot for www.hotelauskunft.de . Shortly after each TV spot ran, we’ve noticed that the autocompleter slowed waaay down, in peak times it was basically unusable. Autocompleters need to be fast, otherwise they’re useless.
How could that be ? Wasn’t everything page cached ? Our apache monitoring showed high IO wait percentages – something was using the disk a whole lot.
Armed with an irb console, I poked around the apache access logs, and grepped out all request strings people had entered over the course of two days into the hotel autocompleter.
irb(main):021:0> hits['1'].size => 10175 irb(main):022:0> hits['2'].size => 12972
Hmm, how often do people search for the same thing ?
irb(main):025:0> hits['1'].uniq.size => 5177 irb(main):026:0> hits['2'].uniq.size => 6163
Waitaminute – only about half requests over the course of a day are unique ?
irb(main):030:0> (hits['1'] + hits['2']).size => 23147 irb(main):029:0> (hits['1'] & hits['2']).size => 1455
Oh, and only 1500 of 23000 requests are shared between two days. Damn, it turns out our assumption about people always searching the same thing wasn’t quite right…
Counting and sorting the unique requests revealed that while, yes, many requests are repeated a lot, there is a massively long tail of small towns, regions, and typos that make about 75% of our daily autocompleter requests (Unfortunatley, I didn’t save all the numbers so I can’t show them here).
So that’s why we had high IO and an unresponsive autocompleter – it was trying to write bajillions of new files to the disk, and then they weren’t even used more than once…
Our solution to the problem was first writing a simple script that reproduced the problem we saw during the peak times. All it did was hit the autocompleter URL with random strings – and voilá, it became unusable almost immediatley.
Then we simply removed the page caching and ran the script again – the autocompleter slowed down when a lot of requests were hitting it, but it didn’t become completely unresponsive anymore.
After that, the course was clear. We had another long hard look at the SQL, and managed to remove the joins. The query times dropped to a third of the original value. And now we started to have problems slowing down the autocompleter with the load testing script – it couldn’t make requests fast enough anymore to make a serious dent in the response times.
What I’ve taken away from the whole thing is this: Next time when thinking about page caching something, make damn sure the caches will actually be used by a significant number of requests. Otherwise, you’re just hurting overall performance.
In lieu of an introduction...
…a shameless plug to my flickr photostream at: http://www.flickr.com/photos/mgpalmer

