Hash Crash Course
by Simon CozensNovember 02, 2006
When I teach about hashes, I do what most Perl tutors and tutorials do: I introduce the hash as a "dictionary": a mapping between one thing and another. The classic example, for instance, is to have a set of English words mapped to French words:
%french = (
apple => "pomme",
pear => "poivre",
orange => "Leon Brocard"
);
Yet the more I look at my code--and more often, the more I look at how to tidy up other people's code--I realize that this is perhaps the least common use of a hash. Much more often, I use hashes in particular idioms which have very little in common with this concept of a mapping. It's interesting to consider the ways that programmers actually use hashes in Perl code.
Counting
Many of the uses of hashes are to "answer questions about lists." When you have an array or list of values and you need to ask about its properties, you will often find yourself using a hash. Start simply by counting the number of particular elements in a list. Here's the naïve approach:
my $count = 0;
for (@list) {
$count++ if $_ eq "apple";
}
You can smarten this up with the use of the grep function:
$count = grep $_ eq "apple", @list;
... but when you need the number of pears in the list, then you have to do the same again:
$count_apples = grep $_ eq "apple", @list;
$count_pears = grep $_ eq "pear", @list;
Now there are two passes over the list, and the situation isn't going to get any prettier from here. What you want is basically a histogram of the data, and you can get that with a hash:
my %histogram;
$histogram{$_}++ for @list;
This hash associates each individual item with its count, and it only traverses the list once.
In a recent case, I was looking at a list of tags associated with various photographs. To lay out the data for display, it was useful to know how many different tags there are in the list. I could get that simply from the number of keys in the histogram:
$unique = keys %histogram;
I could also delete the duplicates and end up with a list of the unique tags:
@unique = keys %histogram;
Finally, I could show the five most popular tags from the list:
@popular =
(sort { $histogram{$b} <=> $histogram{$b} } @unique)[0..4];
This sorts the list of unique tags based on their popularity in the original list, and pulls out the top five.
Uniqueness
Another idiom to get the unique elements from a list is a variation on the "counting things" idiom; but this time you don't care how many of each item there is, just that there's at least one. This allows you to say:
for (@list) { $unique{$_} = 1 }
@unique = keys %unique;
which reduces to:
%unique = map { $_ => 1 } @list;
@unique = keys %unique;
You can take this a stage further with a slightly non-standard idiom to do the whole thing in one operation:
@unique = keys %{{ map { $_ => 1 } @list }};
keys requires a hash, so this funny-looking line creates an anonymous hash ({ map ... @list }) and then turns it into a real hash (%{ $hash_ref }) to feed it to keys.
The advantage of this trick is that you can use a variation of it to work with lists which include objects as well as ordinary scalars. Here's an example of the problem:
my @tags;
push @tags, Memories::Tag->retrieve_random for 1..10;
This should get ten random Memories::Tag objects. However, some of them might be the same, and, assuming that you want to see the unique ones:
%unique = map { $_ => 1 } @tags;
@unique = keys %unique;
Unfortunately, if you then try to do anything with the tags, it all goes horribly wrong:
print $unique[0]->name;
# Can't locate object method "test" via package "Memories::Tag(0x1801380)"
What's happened is that hash keys can only be strings; once you put the object into a hash as the key, then it's not an object any more. Perl turns it into a string. There's no (easy) way to get the strings back into an object. What you need is some kind of data structure to map these broken strings into the original objects. Thankfully, there is one in the form of the hash itself! So the code becomes:
%unique = map { $_ => $_ } @tags;
@unique = values %unique;
First, map the object--which Perl will smash into a string--with another copy of the object, which will retain its object-ness. Then instead of getting the strings out of the hash, which are useless to us, you get the objects.
Of course, this doesn't quite work, because even if retrieve_random retrieves the same tag name twice, it will return two different objects with the same data. (That's true unless the underlying database abstraction layer uses caching, which is another good use for a hash.)
The solution is to make the list unique based on a property of the tag, such as its name. This time, the code is:
%unique = map { $_->name => $_ } @tags;
@unique = values %unique;
Okay, now you have a list of objects which you've retrieved randomly, but which have unique names. The only problem is that you retrieved ten of them to start with, and then whittled the list down to get rid of duplicates. What if you actually wanted 10?
The solution is to keep track of ones that you've already seen, so that you don't create any duplicates in the first place. The question "Have I seen this before?" is easy to answer using another hash-based idiom:
my @tags;
my %seen;
while (@tags < 10) {
my $candidate = Memories::Tag->retrieve_random;
next if $seen{$candidate->name}++;
push @tags, $candidate;
}
Once again you're collecting unique values by putting them into a hash. Suppose that the first candidate tag that comes along is japan. At this point the hash is empty, and so $seen{japan} is zero. Because it's zero, the code continues through the loop with next, but still increments the hash value. Now $seen{japan} stands at 1, and the tag goes into the list. If japan comes past again, $seen{japan} will be positive so the next code will activate, and the tag will not go into the list again. (Don't try this if you have fewer than 10 distinct tags, of course!)
|
Related Reading Intermediate Perl |
Pages: 1, 2 |


