First Touch Penalty on Amazon EBS Volumes

August 8, 2017

Apparently, storage blocks on Amazon EBS volumes that were restored from snapshots incur in significant latency penalties on I/O operations the first time they’re accessed. According to the user guide:

New EBS volumes receive their maximum performance the moment that they are available and do not require initialization (formerly known as pre-warming). However, storage blocks on volumes that were restored from snapshots must be initialized (pulled down from Amazon S3 and written to the volume) before you can access the block. This preliminary action takes time and can cause a significant increase in the latency of an I/O operation the first time each block is accessed. For most applications, amortizing this cost over the lifetime of the volume is acceptable. Performance is restored after the data is accessed once.

This can be particularly painful when restoring Amazon RDS DB snapshots, as the performance can be severely impacted. In order to overcome the first touch penalty, it is advisable to warm up the disk by performing a full table scan or a vacuum on all tables in the database.

Tracking PostgreSQL Execution Statistics on Amazon RDS

July 21, 2017

I’ve recently been working with PostgreSQL on Amazon RDS and found out about the pg_stat_statements module, which allows tracking execution statistics of all SQL statements executed by the server. It can be very helpful when monitoring your application and discovering about slow running queries. It is also very easy to set up.

Enabling it on Amazon RDS

Setting it up on Amazon RDS is easy, but does require rebooting your database. First, go to your AWS Console and create a new parameter group or modify one of the existing ones. You should find the Parameter Groups option on the sidebar to your left (at least in the current version of the interface). First, you should include pg_stat_statements as a shared library to preload into the server. In order to do so, include the pg_stat_statements string in the shared_preload_libraries parameter. The value of the parameter should be a comma-separated list of libraries to preload. Chances are that the value is currently empty, so you’d only need to set it to pg_stat_statements. Afterwards, you should set the pg_stat_statements.track parameter to ALL, enabling tracking of all queries, even those inside stored procedures. If you use and want to keep track of large-sized queries, you should also increase the track_activity_query_size parameter value. Its value specifies the number of bytes reserved to track the currently executing command for each active session. The default value is 1024 but you might want to increase it.

Once you have a parameter group with the relevant parameters defined, you must modify your database to use the new parameter group (if you edited one the database was already using then that’s not necessary) and reboot your database. Rebooting is necessary because libraries set in the shared_preload_libraries parameter will only be loaded at server start. Once rebooted, connect to your database as an RDS superuser and run the following commands to enable the extension and make sure it was set up properly:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
SELECT * FROM pg_stat_statements LIMIT 1;

If it returns one row of query statistics information then it’s set up properly. If not, you might not have restarted your database or have a misconfigured instance.

The pg_stat_statements View

The main thing the module provides is the pg_stat_statements view. The view allows you to query statistics about the execution of SQL statements in the server. The official documentation provides a complete description of all columns the view provides, but the ones I find most useful are:

Name Description
query Text of a representative statement.
calls Number of times executed.
total_time Total time spent in the statement, in milliseconds.
min_time Minimum time spent in the statement, in milliseconds.
max_time Maximum time spent in the statement, in milliseconds.
mean_time Mean time spent in the statement, in milliseconds.
rows Total number of rows retrieved or affected by the statement.

According to the documentation, two plannable queries (SELECT, INSERT, UPDATE and DELETE) will be considered the same if they are semantically equivalent except for the values of literal constants appearing in the query. This means that identical queries will be combined into a single entry in the view.

I like to keep track of queries in which the server is spending most of the time, so I usually run:

SELECT * FROM pg_stat_statements ORDER BY total_time DESC;

Ocasionally, it might be necessary to reset the view. I tend to reset it before running a stress test or otherwise to clean old data. In order to do so, the module provides the pg_stat_statements_reset() function. To discard all statistics gathered so far, simply run:

SELECT pg_stat_statements_reset();

If you want to know more about the module, the official documentation is a good place to start. You should also find instructions there on how to set it up in your own managed instance.

Advent of Code

December 9, 2015

Yesterday, I started solving a series of small programming puzzles in Advent of Code. You get one puzzle per day, as a proper advent calendar. So far, each puzzle is divided in two parts, granting a maximum of two stars per puzzle. The two parts of a puzzle use the same input and the second part is usually a simple variation of the first one. You only have to submit the output to the provided puzzles, so it’s possible to do things by hand, albeit unlikely. The puzzles so far have been simple, either by definition, or by having small input sizes, thus not requiring efficient algorithms to be solved. The tree lights up as you go along, giving it a nice effect:

Advent of
   Code tree with 9 levels lit up.
Advent of Code tree with 9 levels lit up.

I’ve decided to solve them all in Scala. Initially I started in C++, but in day 4’s puzzle it was required to find MD5 hashes and I decided to leverage Java and Scala’s standard library to ease up the process, sticking with it for the rest of the puzzles (so far).

A Beets Plugin for Rateyourmusic

October 28, 2014

Genres are usually neglected in my music library metadata. My primary source for music metadata doesn’t do genres and therefore I tend not to worry too much about them. However, I usually rely on rateyourmusic.com to discover new music, either by peeking at the ratings or by searching through genres of albums that I enjoyed. Rateyourmusic attributes genres to albums based on community voting and keeps a tree of genres whose structure is also voted upon. I tend to find its genre information fairly accurate and more often than not I find myself browsing through the “best” albums of a given genre on the site.

Locally, I use beets to manage my music library. It’s a great piece of software and I’ve been increasingly using its querying capabilities to populate my playlists. Being able to do the same sort of genre-based queries I do on Rateyourmusic locally would therefore be awesome.

Beets has a plugin to fetch genre data from last.fm tags: LastGenre. There are a couple of issues I identify with using last.fm tags as genres:

  1. To be reasonably accurate, one should use the album or track tags. However, those are usually neglected by users, who tend to tag artists mostly.
  2. There’s no hierarchy in the tags (one needs to build it externally). I would like to, for example, be able to search for Ambient and get albums of both Ambient Techno and Dark Ambient.

Taking that into account, I decided to write a plugin to fetch genre information from Rateyourmusic and assign it to albums and items in the beets library: beets-rymgenre. Rateyourmusic doesn’t provide a webservice or API for developers to build upon1 so it’s necessary to scrape for the desired information. Beets is written in Python, and so are its plugins. I have little experience with it and no familiarity with its ecosystem. I ended up using lxml for scraping and requests for the HTTP client, but more lightweight solutions may exist.

I’ve already populated some albums in my library with genre information and I’m happy with the plugin so far. A great plus of having genre information in the library metadata is that now I can ask beets for random albums of a given genre (or set of genres, for that matter) whenever I’m not sure what to listen to next.

  1. There’s actually an open ticket to build a webservice / API for Rateyourmusic open since 2009! 

A New Look

August 6, 2014

I confess that this site has been neglected for far too long. Yesterday I thought that it deserved some attention and tried to improve its design a bit. I like clean, hassle-free designs. The previous looks of this site was my attempt at that. It was rushed though. The fonts were not thoroughly chosen, and it didn’t look good on mobile devices. This time around, I decided to get a stronger inspiration from people that know what they’re doing. I first started to look at themes for Jekyll, as it is the generator that is currently powering this site. Eventually I came around at Poole, a sort of helper for Jekyll. Poole currently supports two official themes: Hyde and Lanyon. I like Hyde’s fonts, but not its sidebar, so I adapted its CSS stylesheet to suit my tastes, using what Poole calls the masthead to have the site name and main navigation links. I’m pretty satisfied with the end result, and it is now finally readable on mobile devices. Let’s hope this provides me with extra motivation to push new content more frequently.