Listen Print

Parsing Protein Domains with Perl

by James D. Tisdall
November 16, 2001

The Perl programming language is popular with biologists because of its practicality. In my book, Beginning Perl for Bioinformatics, I demonstrate how many of the things biologists want to write programs for are readily--even enjoyably--accomplished with Perl.

My book teaches biologists how to program in Perl, even if they have never programmed before. This article will use Perl at the level found in the middle-to-late chapters in my book, after some of the basics have been learned. However, this article can be read by biologists who do not (yet) know any programming. They should be able to skim the program code in this article, only reading the comments, to get a general feel for how Perl is used in practical applications, using real biological data.

Biological data on computers tends to be either in structured ASCII flat files--that is to say, in plain-text files--or in relational databases. Both of these data sources are easy to handle with Perl programs. For this article, I will discuss one of the flat-file data sources, the Prosite database, which contains valuable biological information about protein domains. I will demonstrate how to use Perl to extract and use the protein domain information. In Beginning Perl for Bioinformatics I also show how to work with several other similar data sources, including GenBank (Genetic Data Bank), PDB (Protein DataBank), BLAST (Basic Local Alignment Search Tool) output files, and REBASE (Restriction Enzyme Database).

What is Prosite?

Related Reading

Beginning Perl for BioinformaticsBeginning Perl for Bioinformatics
By James Tisdall
Table of Contents
Index
Sample Chapter
Full Description

Prosite stands for "A Dictionary of Protein Sites and Patterns." To learn more about the fascinating biology behind Prosite, visit the Prosite User Manual. Here's an introductory description of Prosite from the user manual:

"Prosite is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs."

In some cases, the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, it can be identified by the occurrence in its sequence of a particular cluster of residue types, variously known as a pattern, a motif, a signature, or a fingerprint. These motifs arise because of particular requirements on the structure of specific regions of a protein, which may be important, for example, for their binding properties, or for their enzymatic activity.

Prosite is available as a set of plain-text files that provide the data, plus documentation. The Prosite home page provides a user interface that allows you to query the database and examine the documentation. The database can also be obtained for local installation from the Prosite ftp site. Its use is free of charge for noncommercial users.

There is some fascinating and important biology involved here; and in the programs that follow there are interesting and useful Perl programming techniques. See the Prosite User Manual for the biology background, and Beginning Perl for Bioinformatics for the programming background. Or just keep reading to get a taste for what is possible when you combine programming skills with biological data.

Prosite Data

The Prosite data can be downloaded to your computer. It is in the ASCII flat file called prosite.dat and is more than 4MB in size. A small version of this file created for this article, called prosmall.dat, is available here. This version of the data has just the first few records from the complete file, making it easier for you to download and test, and it's the file that we'll use in the code discussed later in this article.

Prosite also provides an accompanying data file, prosite.doc, which contains documentation for all the records in prosite.dat. Though we will not use it for this article, I do recommend you look at it and think about how to use the information along with the code presented here if you plan on doing more with Prosite.


O'Reilly Bioinformatics Technology Conference James Tisdall will be speaking at O'Reilly's first Bioinformatics Technology Conference, January 28-31, 2002, in Tuscon, Arizona. For more information visit Bioinformatics Conference Web site.


The Prosite data in prosite.dat (or our much smaller test file prosmall.dat) is organized in "records," each of which consists of several lines, and which always include an ID line and a termination line containing "//". The Prosite lines all begin with a two-character code that specifies the kind of data that appears on that line. Here's a breakdown of all the possible line types that a record may contain from the Prosite User Manual:

ID
Identification (Begins each entry; one per entry)
AC
Accession number (one per entry)
DT
Date (one per entry)
DE
Short description (one per entry)
PA
Pattern (>=0 per entry)
MA
Matrix/profile (>=0 per entry)
RU
Rule (>=0 per entry)
NR
Numerical results (>=0 per entry)
CC
Comments (>=0 per entry)
DR
Cross references to SWISS-PROT (>=0 per entry)
3D
Cross references to PDB (>=0 per entry)
DO
Pointer to the documentation file (one per entry)
//
Termination line (Ends each entry; one per entry)

Each of these line types has certain kinds of information that are formatted in a specific manner, as is detailed in the Prosite documentation.

Prosite Patterns

Let's look specifically at the Prosite patterns. These are presented in a kind of mini-language that describes a set of short stretches of protein that may be a region of known biological activity. Here's the description of the pattern "language" from the Prosite User Manual:

The PA (PAttern) lines contains the definition of a Prosite pattern. The patterns are described using the following conventions:

  • The standard IUPAC one-letter codes for the amino acids are used.
  • The symbol `x' is used for a position where any amino acid is accepted.
  • Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
  • Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.
  • Each element in a pattern is separated from its neighbor by a `-'.
  • Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.
  • When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol.
  • A period ends the pattern.

Perl Subroutine to Translate Prosite Patterns into Perl Regular Expressions

In order to use this pattern data in our Perl program, we need to translate the Prosite patterns into Perl regular expressions, which are the main way that you search for patterns in data in Perl. For the sake of this article I will assume that you know the basic regular expression syntax. (If not, just read the program comments, and skip the Perl regular expressions.) As an example of what the following subroutine does, it will translate the Prosite pattern [AC]-x-V-x(4)-{ED}. into the equivalent Perl regular expression [AC].V.{4}[^ED]

Here, then, is our first Perl code, the subroutine PROSITE_2_regexp, to translate the Prosite patterns to Perl regular expressions:


#
# Calculate a Perl regular expression
#  from a PROSITE pattern
#
sub PROSITE_2_regexp {

  #
  # Collect the PROSITE pattern
  #
  my($pattern) = @_;

  #
  # Copy the pattern to a regular expression
  #
  my $regexp = $pattern;

  #
  # Now start translating the pattern to an
  #  equivalent regular expression
  #

  #
  # Remove the period at the end of the pattern
  #
  $regexp =~ s/.$//;

  #
  # Replace 'x' with a dot '.'
  #
  $regexp =~ s/x/./g;

  #
  # Leave an ambiguity such as '[ALT]' as is.
  #   However, there are two patterns [G>] that need
  #   special treatment (and the PROSITE documentation
  #   is a bit vague, perhaps).
  #
  $regexp =~ s/\[G\>\]/(G|\$)/;
  
  #
  # Ambiguities such as {AM} translate to [^AM].
  #
  $regexp =~ s/{([A-Z]+)}/[^$1]/g;

  #
  # Remove the '-' between elements in a pattern
  #
  $regexp =~ s/-//g;

  #
  # Repetitions such as x(3) translate as x{3}
  #
  $regexp =~ s/\((\d+)\)/{$1}/g;

  #
  # Repetitions such as x(2,4) translate as x{2,4}
  #
  $regexp =~ s/\((\d+,\d+)\)/{$1}/g;

  #
  # '<' becomes '^' for "beginning of sequence"
  #
  $regexp =~ s/\</^/;

  #
  # '>' becomes '$' for "end of sequence"
  #
  $regexp =~ s/\>/\$/;

  #
  # Return the regular expression
  #
  return $regexp;
}

Subroutine PROSITE_2_regexp takes the Prosite pattern and translates its parts step by step into the equivalent Perl regular expression, as explained in the comments for the subroutine. If you do not know Perl regular expression syntax at this point, just read the comments--that is, the lines that start with the # character. That will give you the general idea of the subroutine, even if you don't know any Perl at all.


Learn more about the power of regular expressions from O'Reilly's Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools.


Pages: 1, 2

Next Pagearrow





Contact Us | Advertise with Us | Privacy Policy | Press Center | Jobs | Submissions Guidelines

Copyright © 2000-2008 O’Reilly Media, Inc. All Rights Reserved. | (707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on the O'Reilly Network are the property of their respective owners.

For problems or assistance with this site, email