Functional group description language
Functional groups are described in the file funcgrp.chm. Each entry consists of the following:
1. Group name. This identifier must be unique and cannot be a reserved word. This means that it is not possible to use an atom type as a name. The name must begin with a letter or underscore, followed by any length of letters, numbers or underscores.
2. Pattern string. This is described below.
3. Complexity.
For example, the line describing carboxylic acids might look like this:
ACID = "C(=O O([H]) [CH])", 10
The pattern string consists of several "hard" and "soft" atoms. The hard atoms must be present in the structure and are automatically numbered from left to right and assigned the default variables A1 ... An. Soft atoms may be several atoms that must match a list of elements. A is a hard atom and O is a soft atom.
description = "A"
A = S(A... O)
O = [S...]^{n-m} or [S...]^n
S = chemical element symbol (case sensitive).
Fields in italics are optional. Notice that the above definition is recursive.
The pattern must begin with a hard atom, which is called the root atom, optionally followed by a neighbor list in parentheses. The importance of the root atom is that it is used to calculate distances between pairs of functional groups.
The neighbor list is a list of zero or more hard atoms, optionally followed by a soft atom. The soft atom is enclosed by brackets and optionally followed by a number or number range. Double and triple bonds are represented by = or %, respectively.
In the carboxylic acid example we have as the root atom the carboxyl carbon (numbered A1), followed by: an oxygen bonded to the carbon with a double bond (A2), an oxygen with a single bond (A3), which in turn is followed by a soft hydrogen, and finally a carbon or hydrogen joined to the root atom. Why did we use a soft hydrogen instead of a hard hydrogen? Actually, the only difference is that as a soft hydrogen that atom will not be numbered. Since all hydrogen atoms in our program are implicit, it does not make sense to number them.
Let us look at another example: the alcohol.
ALCOHOL = "C(O([H]) [CH]^3)", 10
Here the carbon is the root atom, and its neighbors are the oxygen (which in turn is bonded to hydrogen) atom and three soft atoms, each of which may be either carbon or hydrogen. Notice that this definition rules out enols, gem-diols, a-halohydrins, phenols, among others. We decided that this restrictive definition was better from a synthetic point of view. For example, even if an enol may be regarded as a kind of alcohol, its reactions and synthesis are different enough to regard it as an independent functional group.
Here is the complete list of special characters used in functional group descriptions:
[ ] Encloses soft atoms. Between the brackets there must be a list of one or more chemical elements. The number of soft atoms may be optionally specified using ^n after the closing bracket.
^ Must be followed by a number or a range of numbers. Ranges consist of two numbers separated by a hyphen and enclosed in curly brackets. Numbers may only be used for soft atoms.
{ } Encloses a range.
( ) Encloses the neighbor list of an atom. Only hard atoms may have neighbor lists. There may be parentheses inside parentheses, generating a tree-like structure. The neighbor list can have zero or more hard atoms, optionally followed by one soft atom.
- Separates the lower limit from the upper limit of a range.
= Double bond. Must be immediately before a hard atom.
% Triple bond. Must be immediately before a hard atom.
" " Encloses the pattern string.