Third and final day with PostgreSQL. Today all kind of text and other fancy searches are looked at. Here the SQL standard is resolutely left behind, as pretty much everything is PostgreSQL specific.
PostgreSQL manages advanced index data structures that allow it to efficiently query data using something better than basic comparisons. It especially shines in its handling of geospatial data (searching by distance is a non-trivial problem), but has many options for text searches as well.
Now, it can be easy to get carried away and try to use PostgreSQL for everything, even when a superior alternative option exists. If I had any need for a full text search, I would also look at Lucene and related options. But PostgreSQL can still provide an easy, integrated solution for many situations.
Indexes are not like pixie dust: you cannot just add some and hope all your performance problems will go away (as I learned the hard way). Benchmarks and the
EXPLAIN command must be used to confirm the improvements brought by any index.
This section is based on the first beta version of the book. I hope they fill clarify this part in the final version.
First, let’s look a bit more at the fist index introduced today:
The book states that is creates an index for pattern matching. Think about it for a few seconds, try to imagine what it would look like… Yes, the solution is not trivial, so being able to index for pattern matching sounds like magic. Or pixie dust. But does it work?
1 2 3 4 5 6
It seems that no, it does not work. The optimizer proposes to use a sequential scan, not an index.
The problem here is that
ILIKE is not really supported by the optimizer. It is a PostgreSQL extension which, while useful, is not that well integrated. The standard (and SQL compliant) way to do a case insensitive search is:
1 2 3 4 5 6
Ok, not quite there yet. But that’s normal. We are not using the column directly, so an index on the original values is not going to work. However PostgreSQL has a very nice feature called functional index: it is possible to create an index on the result of a function. Let’s drop the original index and create one with
lower(title) as the indexed value:
Does it look better now?
1 2 3 4 5 6 7 8
Yes, that’s better. What about
1 2 3 4 5 6
No, it still cannot use the index.
What about the regular expression query?
1 2 3 4 5 6 7
No, it does not like the case insensitive operator. Would a
lower(title) conversion work?
1 2 3 4 5 6 7
Still not. Well, we’re not looking for a particular pattern, but for everything else. A negation is not easy to propagate through the optimizer logic, so it should not be surprising that it still cannot use the index. But this (almost) similar query does:
1 2 3 4 5 6 7 8 9
But the general conclusion and take-home lesson is that adding indexes without checking their impact on queries is more than useless: it adds cost on data creation and update, with no compensation at query time.
The contributed packages shipped with PostgreSQL are documented here.
The Wikipedia page on the topic is already very good.
Stored Procedure based movie recommendation
I’m assuming the name of an actor or movie can have errors (after all, end users are known to make and cause errors). The first step will be to identify whether the name is closer to an actor’s name or a movie’s name. The second step will use existing queries to either propose 5 movies from the same actor if the name was closer to an actor’s name or 5 similar movies (using the cube bounding technique) if the name was closer to a movie’s.
The use of
UNION over several strategies to identify either movies or actors gives a lot of freedom. I use the
levenshtein function to select the best match identified by each strategy.
Also, because the full text search query must have a specific format (for instance, each search term must be separated by
&), I use a few text replacement functions to clean up the input search.
Finally, the return value is a
movies. The meaning is that I will return rows that have the same type as rows from the
movies table. If the query was simple, I could just use the
SELECT statement as the body for the function, but as I have to chose between queries, I use the
RETURN NEXT command instead. Explanations can be found in the fine PostgreSQL manual.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Wrapping Up PostgreSQL
Well, not really wrapping it up. I’ll keep using it, if only for fun. This database is powerful, mature, well documented, and extremely flexible over the domain of relational modeling. It is possible to define user defined types with arbitrary content, and fancy indexing. If you find an obscure academic paper describing a exotic indexing for a new datatype, chances are that PostgreSQL will support it, given enough C programming.
The comments from the book on this database are very fair, and I would strongly recommend anyone to give it a good and honest evaluation.