SnoPy - Snobol Pattern Matching Extension for Python

    snopy - Snobol Pattern Matching
      Notes and Restrictions
    Introduction
    Installation
    Pattern Matching Tutorial
      Concatenation and Alternation
      Moving the Start Point
      Other Pattern Elements
      Pattern Construction Functions
      Recursive Pattern Matching
      Pattern Assignment Operations
      Deferred Matching
      Deferred Replacement
      Examples of Pattern Matching
      Correspondence with Pattern Matching in SPITBOL
      Tutorial Examples that do not work
  SnoPy Examples
      Contents.py
      Cutting lines at the margin
  Implementation
      Interfaces
      SnoPy.py
      The Swig Interface Specification
      The C Program
      Parse

snopy - Snobol Pattern Matching

snopy provides the capability to use SNOBOL pattern matching in Python programs.

The module defines the following functions and data items:


def createpattern(pattern_string, index): Compiles a pattern and
    stores it in the pattern array with index specified by index.  It
    is the users responsibility to manage the array of patterns which
    he is using. The array may have up to 50 such patterns.

def match(target_string, pattern):
    Determine whether the target string contains a substring defined
    by the pattern.

def matchc(target_string, index):
    Determine whether the target string contains a substring defined
    by the precompiled pattern in the pattern array specified by index.

def match_replace(target_string, pattern, replacement_string):
    Determine whether the target string contains a substring defined
    by the pattern.  If the match is successful then the portion of
    target_string which matches the pattern is replaced by the
    replacement_string.

def matchc_replace(target_string, index, replacement_string):
    Determine whether the target string contains a substring defined
    by the precompiled pattern in the pattern array specified by
    index.  If the match is successful then the portion of
    target_string which matches the pattern is replaced by the
    replacement_string.

See the tutorial section below for a discussion of writing pattern.

createpattern returns a 1 for success and 0 indicates failure. In case of a pattern match success, the calls return a SnoPy object which has the following data elements:


success: 1 for success, 0 for failure.  
no_strings: number of strings returned. 
no_cursor_positions: number of cursor positions returned.


tuple: The date elements returned by the interface program. Always
       present.
	   tuple[0]: success (1 for success; 0 for failure)
	   tuple[1]: A list of the returned strings. The first element is
	             replacement string in the case of a replacement call;
				 "" otherwise.  The other strings are values of n1,
                 n2, etc.
       tuple[2]: A list of cursor values returned from the match.
strings: tuple[1]
cursors: tuple[2]

Notes and Restrictions

  1. Pattern must be string constants. They have to be written in a highly stylized way imitating the syntax of Ada because of my use of the Ada pattern matcher.

    For example, consider the Ada pattern sequence from the tutorial:
    '(' & Len (1) * Char & ')'.

    This pattern would be specified as

    "'(' & Len(1) * Char & ')'".

    You are not allowed to have a pattern such as

    '(' & Len(n) & span(ints) ')'.

    You would have to do something like the following:

    "'(' & Len(%d) & span(%s) ')'" % (n, ints)

  2. Avoid having constructs like

    'abc' & 'def'

    within a pattern Use 'abcdef' instead.
  3. Avoid a pattern which is just a string. snopy can not currently handle it. You are much better to use the Python string functions instead.
  4. The assignment operators '*' and '**' can be followed by the name of any Ada string variable. In snopy you are restricted to using one of the five names n1, n2, n3, n4, or n5. After the match, the variables are returned in tuple[1]. For instance, if an assignment is made to n3, the result will be returned in strings[3]. Similarly, if the match is a replacement match the resulting string will be returned in strings[0]. Strings are the same as tuple[1].

    The name 'output' may be used as the target of the assignment operators, in which case the assignment is written as a line of output. See the example Cutting lines at the margin.

  5. Returned strings are limited to 1024 characters.
  6. The cursor assignment operator is restricted to five special variables. One may specify x1, x2, x3, x4, or x5. After the match, the values are returned in tuple[2] which is the same as cursors. The first element in tuple[2] which is a list is always 0. x1 will be found in cursors[1], x2 in cursors[2], etc.
  7. As will be mentioned several time, recursive pattern matching is not supported.

Introduction

I have long thought that SNOBOL pattern matching is superion to the use of regular expressions. This is a description of SnoPy which is a Python extension which enables the python user to utilize a somewhat restricted version of SNOBOL pattern matching.

Knowledge of SNOBOL really dates me. I am proud to say that for a period of nearly 25 years, hardly a week went past that I did not write some code in SNOBOL. About two years ago, I adopted Python as my favorite language but I really never liked using regular expressions. Could never seem to get the number of back slashes right. Here is my attempt to use SNOBOL patterns in Python. I guess that I am in the category of having been spoiled for life by an exposure to SNOBOL

Since SNOBOL is unknown to most people, I want to include a few references. Looking in Google I came across the following references to SNOBOL: The SNOBOL and SPITBOL Page, SNOBOL4 and SPITBOL Information, The SNOBOL Programming Language, and Phil's SNOBOL Resources Page.

I am perfectly aware that there are few Python people chomping at the bit to get their hands on Python support. But I think that SNOBOL was a great piece of nearly lost work and I would like to raise a few eyebrows. I also am aware that this is not a polished piece of work. Part of the reason is that while I used SNOBOL for many years before moving onto Python, I tended to use rather simple pattern matching. I have always shied away from creating programming exercises of the ilk "See what I can do in just one line of X" for any value of X. For a long time my motto has been "God, save us from the clever programmer."

In what follows you will find the term SNOBOL and spitbol. SNOBOL is the language and Spitbol is a compiler for the language. So think of them as being the same for these purposes.

SnoPy was built using SWIG, flex and bison, and an implementation of SNOBOL pattern matching in Ada.

The interface includes three functions.

The pattern match calls return a three element tuple:
(success, string_list, cursor_list)
where:


Installation

At the present time this package only supports Linux. To install the package the following packages are needed.

Basically the steps are:
  1. Read the tutorial to be sure you want to do this at all.
  2. Unzip the distribution file.
  3. Type make.
  4. Run the tests.
    python SnoPy.py
    python test.py
  5. You can experiment in the distribution directory.
  6. Move the shared library, snobolmodule.so, and SnoPy.py to the appropriate place in the python library. For me that is /usr/local/lib/python2.2/site-packages.

When I work out how to generate a dll from the gnat/gcc combination, I will put out a Win32 package. Would appreciate help in doing that.

Pattern Matching Tutorial

This is a copy of a comment from 'g-spipat.ads' in GNAT.SPITBOL.PATTERNS, Revision: 1.17. It is copied in its entirety but has been html'ized to allow hyperlinks. Be aware that there are some features described that are not supported in SnoPy. Those features include recursive pattern matching and most stuff that require functions which return patterns. Another way of expressing it is to say that anything requiring a data type of pattern in Python is not supported. In fact most of the examples in the tutorial do not work because they involve pattern variables or functions. I have left them in to show the reader some of the things that can be done. Let me also say that in the time that I have been using SNOBOL pattern matching, I have not made much use of recursive pattern matching or created many pattern functions.

A pattern matching operation (a call to one of the Match subprograms) takes a subject string and a pattern, and optionally a replacement string. The replacement string option is only allowed if the subject is a variable.

The pattern is matched against the subject string, and either the match fails, or it succeeds matching a contiguous substring. If a replacement string is specified, then the subject string is modified by replacing the matched substring with the given replacement.

Concatenation and Alternation

A pattern consists of a series of pattern elements. The pattern is built up using either the concatenation operator:

   A & B
which means match A followed immediately by matching B, or the alternation operator:
   A or B
which means first attempt to match A, and then if that does not succeed, match B.

There is full backtracking, which means that if a given pattern element fails to match, then previous alternatives are matched. For example if we have the pattern:

  (A or B) & (C or D) & (E or F)

First we attempt to match A, if that succeeds, then we go on to try to match C, and if that succeeds, we go on to try to match E. If E fails, then we try F. If F fails, then we go back and try matching D instead of C. Let's make this explicit using a specific example, and introducing the simplest kind of pattern element, which is a literal string. The meaning of this pattern element is simply to match the characters that correspond to the string characters. Now let's rewrite the above pattern form with specific string literals as the pattern elements:


  ("ABC" or "AB") & ("DEF" or "CDE") & ("GH" or "IJ")

The following strings will be attempted in sequence:

   ABC . DEF . GH
   ABC . DEF . IJ
   ABC . CDE . GH
   ABC . CDE . IJ
   AB . DEF . GH
   AB . DEF . IJ
   AB . CDE . GH
   AB . CDE . IJ
Here we use the dot simply to separate the pieces of the string matched by the three separate elements.

Moving the Start Point

A pattern is not required to match starting at the first character of the string, and is not required to match to the end of the string. The first attempt does indeed attempt to match starting at the first character of the string, trying all the possible alternatives. But if all alternatives fail, then the starting point of the match is moved one character, and all possible alternatives are attempted at the new anchor point.

The entire match fails only when every possible starting point has been attempted. As an example, suppose that we had the subject string

  "ABABCDEIJKL"
matched using the pattern in the previous example:

  ("ABC" or "AB") & ("DEF" or "CDE") & ("GH" or "IJ")

would succeed, after two anchor point moves:
  "ABABCDEIJKL"
     ^^^^^^^
     matched
     section

This mode of pattern matching is called the unanchored mode. It is also possible to put the pattern matcher into anchored mode by setting the global variable Anchored_Mode to True. This will cause all subsequent matches to be performed in anchored mode, where the match is required to start at the first character.

We will also see later how the effect of an anchored match can be obtained for a single specified anchor point if this is desired.

Other Pattern Elements

In addition to strings (or single characters), there are many special pattern elements that correspond to special predefined alternations:

  Arb       Matches any string. First it matches the null string, and
            then on a subsequent failure, matches one character, and
            then two characters, and so on. It only fails if the
            entire remaining string is matched.

  Bal       Matches a non-empty string that is parentheses balanced
            with respect to ordinary () characters. Examples of
            balanced strings are "ABC", "A((B)C)", and "A(B)C(D)E".
            Bal matches the shortest possible balanced string on the
            first attempt, and if there is a subsequent failure,
            attempts to extend the string.

  Cancel    Immediately aborts the entire pattern match, signalling
            failure. This is a specialized pattern element, which is
            useful in conjunction with some of the special pattern
            elements that have side effects.

  Fail      The null alternation. Matches no possible strings, so it
            always signals failure. This is a specialized pattern
            element, which is useful in conjunction with some of the
            special pattern elements that have side effects.

  Fence     Matches the null string at first, and then if a failure
            causes alternatives to be sought, aborts the match (like
            a Cancel). Note that using Fence at the start of a pattern
            has the same effect as matching in anchored mode.

  Rest      Matches from the current point to the last character in
            the string. This is a specialized pattern element, which
            is useful in conjunction with some of the special pattern
            elements that have side effects.

  Succeed   Repeatedly matches the null string (it is equivalent to
            the alternation ("" or "" or "" ....). This is a special
            pattern element, which is useful in conjunction with some
            of the special pattern elements that have side effects.

Pattern Construction Functions

The following functions construct additional pattern elements

  Any(S)    Where S is a string, matches a single character that is
            any one of the characters in S. Fails if the current
            character is not one of the given set of characters.

  Arbno(P)  Where P is any pattern, matches any number of instances
            of the pattern, starting with zero occurrences. It is
            thus equivalent to ("" or (P & ("" or (P & ("" ....)))).
            The pattern P may contain any number of pattern elements
            including the use of alternatiion and concatenation.

  Break(S)  Where S is a string, matches a string of zero or more
            characters up to but not including a break character
            that is one of the characters given in the string S.
            Can match the null string, but cannot match the last
            character in the string, since a break character is
            required to be present.

  BreakX(S) Where S is a string, behaves exactly like Break(S) when
            it first matches, but if a string is successfully matched,
            then a susequent failure causes an attempt to extend the
            matched string.

  Fence(P)  Where P is a pattern, attempts to match the pattern P
            including trying all possible alternatives of P. If none
            of these alternatives succeeds, then the Fence pattern
            fails. If one alternative succeeds, then the pattern
            match proceeds, but on a subsequent failure, no attempt
            is made to search for alternative matches of P. The
            pattern P may contain any number of pattern elements
            including the use of alternatiion and concatenation.

  Len(N)    Where N is a natural number, matches the given number of
            characters. For example, Len(10) matches any string that
            is exactly ten characters long.

  NotAny(S) Where S is a string, matches a single character that is
            not one of the characters of S. Fails if the current
            characer is one of the given set of characters.

  NSpan(S)  Where S is a string, matches a string of zero or more
            characters that is among the characters given in the
            string. Always matches the longest possible such string.
            Always succeeds, since it can match the null string.

  Pos(N)    Where N is a natural number, matches the null string
            if exactly N characters have been matched so far, and
            otherwise fails.

  Rpos(N)   Where N is a natural number, matches the null string
            if exactly N characters remain to be matched, and
            otherwise fails.

  Rtab(N)   Where N is a natural number, matches characters from
            the current position until exactly N characters remain
            to be matched in the string. Fails if fewer than N
            unmatched characters remain in the string.

  Tab(N)    Where N is a natural number, matches characters from
            the current position until exactly N characters have
            been matched in all. Fails if more than N characters
            have already been matched.

  Span(S)   Where S is a string, matches a string of one or more
            characters that is among the characters given in the
            string. Always matches the longest possible such string.
            Fails if the current character is not one of the given
            set of characters.

Recursive Pattern Matching

The plus operator (+P) where P is a pattern variable, creates a recursive pattern that will, at pattern matching time, follow the pointer to obtain the referenced pattern, and then match this pattern. This may be used to construct recursive patterns. Consider for example:

   P := ("A" or ("B" & (+P)))

On the first attempt, this pattern attempts to match the string "A". If this fails, then the alternative matches a "B", followed by an attempt to match P again. This second attempt first attempts to match "A", and so on. The result is a pattern that will match a string of B's followed by a single A.

This particular example could simply be written as NSpan('B') & 'A', but the use of recursive patterns in the general case can construct complex patterns which could not otherwise be built.

Pattern Assignment Operations

In addition to the overall result of a pattern match, which indicates success or failure, it is often useful to be able to keep track of the pieces of the subject string that are matched by individual pattern elements, or subsections of the pattern.

The pattern assignment operators allow this capability. The first form is the immediate assignment:

   P * S
Here P is an arbitrary pattern, and S is a variable of type VString that will be set to the substring matched by P. This assignment happens during pattern matching, so if P matches more than once, then the assignment happens more than once.

The deferred assignment operation:

  P ** S
avoids these multiple assignments by deferring the assignment to the end of the match. If the entire match is successful, and if the pattern P was part of the successful match, then at the end of the matching operation the assignment to S of the string matching P is performed.

The cursor assignment operation:

  Setcur(N'Access)
assigns the current cursor position to the natural variable N. The cursor position is defined as the count of characters that have been matched so far (including any start point moves). Finally the operations * and ** may be used with values of type Text_IO.File_Access. The effect is to do a Put_Line operation of the matched substring. These are particularly useful in debugging pattern matches.

Deferred Matching

The pattern construction functions (such as Len and Any) all permit the use of pointers to natural or string values, or functions that return natural or string values. These forms cause the actual value to be obtained at pattern matching time. This allows interesting possibilities for constructing dynamic patterns as illustrated in the examples section.

In addition the (+S) operator may be used where S is a pointer to string or function returning string, with a similar deferred effect.

A special use of deferred matching is the construction of predicate functions. The element (+P) where P is an access to a function that returns a Boolean value, causes the function to be called at the time the element is matched. If the function returns True, then the null string is matched, if the function returns False, then failure is signalled and previous alternatives are sought.

Deferred Replacement

The simple model given for pattern replacement (where the matched substring is replaced by the string given as the third argument to Match) works fine in simple cases, but this approach does not work in the case where the expression used as the replacement string is dependent on values set by the match.

For example, suppose we want to find an instance of a parenthesized character, and replace the parentheses with square brackets. At first glance it would seem that: Match (Subject, '(' & Len (1) * Char & ')', '[' & Char & ']'); would do the trick, but that does not work, because the third argument to Match gets evaluated too early, before the call to Match, and before the pattern match has had a chance to set Char.

To solve this problem we provide the deferred replacement capability. With this approach, which of course is only needed if the pattern involved has side effects, is to do the match in two stages. The call to Match sets a pattern result in a variable of the private type Match_Result, and then a subsequent Replace operation uses this Match_Result object to perform the required replacement.

Using this approach, we can now write the above operation properly in a manner that will work:

  M : Match_Result;
  ...
  Match (Subject, '(' & Len (1) * Char & ')', M);
  Replace (M, '[' & Char & ']');
As with other Match cases, there is a function and procedure form of this match call. A call to Replace after a failed match has no effect. Note that Subject should not be modified between the calls.

Examples of Pattern Matching

First a simple example of the use of pattern replacement to remove a line number from the start of a string. We assume that the line number has the form of a string of decimal digits followed by a period, followed by one or more spaces.

   Digs : constant Pattern := Span("0123456789");

   Lnum : constant Pattern := Pos(0) & Digs & '.' & Span(' ');
Now to use this pattern we simply do a match with a replacement:
   Match (Line, Lnum, "");
which replaces the line number by the null string. Note that it is also possible to use an Ada.Strings.Maps.Character_Set value as an argument to Span and similar functions, and in particular all the useful constants 'in Ada.Strings.Maps.Constants are available. This means that we could define Digs as:
   Digs : constant Pattern := Span(Decimal_Digit_Set);

The style we use here, of defining constant patterns and then using them is typical. It is possible to build up patterns dynamically, but it is usually more efficient to build them in pieces in advance using constant declarations. Note in particular that although it is possible to construct a pattern directly as an argument for the Match routine, it is much more efficient to preconstruct the pattern as we did in this example.

Now let's look at the use of pattern assignment to break a string into sections. Suppose that the input string has two unsigned decimal integers, separated by spaces or a comma, with spaces allowed anywhere. Then we can isolate the two numbers with the following pattern:

   Num1, Num2 : aliased VString;

   B : constant Pattern := NSpan(' ');

   N : constant Pattern := Span("0123456789");

   T : constant Pattern :=
         NSpan(' ') & N * Num1 & Span(" ,") & N * Num2;

The match operation Match (" 124, 257 ", T) would assign the string 124 to Num1 and the string 257 to Num2.

Now let's see how more complex elements can be built from the set of primitive elements. The following pattern matches strings that have the syntax of Ada 95 based literals:

   Digs  : constant Pattern := Span(Decimal_Digit_Set);
   UDigs : constant Pattern := Digs & Arbno('_' & Digs);

   Edig  : constant Pattern := Span(Hexadecimal_Digit_Set);
   UEdig : constant Pattern := Edig & Arbno('_' & Edig);

   Bnum  : constant Pattern := Udigs & '#' & UEdig & '#';

A match against Bnum will now match the desired strings, e.g. it will match 16#123_abc#, but not a#b#. However, this pattern is not quite complete, since it does not allow colons to replace the pound signs. The following is more complete:

   Bchar : constant Pattern := Any("#:");
   Bnum  : constant Pattern := Udigs & Bchar & UEdig & Bchar;

but that is still not quite right, since it allows # and : to be mixed, and they are supposed to be used consistently. We solve this by using a deferred match.

   Temp  : aliased VString;

   Bnum  : constant Pattern :=
             Udigs & Bchar * Temp & UEdig & (+Temp)

Here the first instance of the base character is stored in Temp, and then later in the pattern we rematch the value that was assigned.

For an example of a recursive pattern, let's define a pattern that is like the built in Bal, but the string matched is balanced with respect to square brackets or curly brackets.

The language for such strings might be defined in extended BNF as

  ELEMENT ::= <any character other than [] or {}>
              | '[' BALANCED_STRING ']'
              | '{' BALANCED_STRING '}'

  BALANCED_STRING ::= ELEMENT {ELEMENT}

Here we use {} to indicate zero or more occurrences of a term, as is common practice in extended BNF. Now we can translate the above BNF into recursive patterns as follows:

  Element, Balanced_String : aliased Pattern;
  .
  .
  .
  Element := NotAny ("[]{}")
               or
             ('[' & (+Balanced_String) & ']')
               or
             ('{' & (+Balanced_String) & '}');

  Balanced_String := Element & Arbno (Element);

Note the important use of + here to refer to a pattern not yet defined. Note also that we use assignments precisely because we cannot refer to as yet undeclared variables in initializations.

Now that this pattern is constructed, we can use it as though it were a new primitive pattern element, and for example, the match:

  Match ("xy[ab{cd}]", Balanced_String * Current_Output & Fail);

will generate the output:

   x
   xy
   xy[ab{cd}]
   y
   y[ab{cd}]
   [ab{cd}]
   a
   ab
   ab{cd}
   b
   b{cd}
   {cd}
   c
   cd
   d

Note that the function of the fail here is simply to force the pattern Balanced_String to match all possible alternatives. Studying the operation of this pattern in detail is highly instructive

.

Finally we give a rather elaborate example of the use of deferred matching. The following declarations build up a pattern which will find the longest string of decimal digits in the subject string.

   Max, Cur : VString;
   Loc      : Natural;

   function GtS return Boolean is
   begin
      return Length (Cur) > Length (Max);
   end GtS;

   Digit : constant Character_Set := Decimal_Digit_Set;

   Digs  : constant Pattern := Span(Digit);

   Find : constant Pattern :=
     "" * Max & Fence            & -- initialize Max to null
     BreakX (Digit)              & -- scan looking for digits
     ((Span(Digit) * Cur         & -- assign next string to Cur
      (+GtS'Unrestricted_Access) & -- check size(Cur) > Size(Max)
      Setcur(Loc'Access))          -- if so, save location
               * Max)            & -- and assign to Max
     Fail;                         -- seek all alternatives

As we see from the comments here, complex patterns like this take on aspects of sequential programs. In fact they are sequential programs with general backtracking. In this pattern, we first use a pattern assignment that matches null and assigns it to Max, so that it is initialized for the new match. Now BreakX scans to the next digit. Arb would do here, but BreakX will be more efficient. Once we have found a digit, we scan out the longest string of digits with Span, and assign it to Cur. The deferred call to GtS tests if the string we assigned to Cur is the longest so far. If not, then failure is signalled, and we seek alternatives (this means that BreakX will extend and look for the next digit string). If the call to GtS succeeds then the matched string is assigned as the largest string so far into Max and its location is saved in Loc. Finally Fail forces the match to fail and seek alternatives, so that the entire string is searched.

If the pattern Find is matched against a string, the variable Max at the end of the pattern will have the longest string of digits, and Loc will be the starting character location of the string. For example, Match("ab123cd4657ef23", Find) will assign "4657" to Max and 11 to Loc (indicating that the string ends with the eleventh character of the string).

Note: the use of Unrestricted_Access to reference GtS will not be needed if GtS is defined at the outer level, but definitely will be necessary if GtS is a nested function (in which case of course the scope of the pattern Find will be restricted to this nested scope, and this cannot be checked, i.e. use of the pattern outside this scope is erroneous). Generally it is a good idea to define patterns and the functions they call at the outer level where possible, to avoid such problems.

Correspondence with Pattern Matching in SPITBOL

Generally the Ada syntax and names correspond closely to SPITBOL syntax for pattern matching construction.

The basic pattern construction operators are renamed as follows:

     Spitbol     Ada

      (space)      &
         |         or
         $         *
         .         **

The Ada operators were chosen so that the relative precedences of these operators corresponds to that of the Spitbol operators, but as always, the use of parentheses is advisable to clarify.

The pattern construction operators all have similar names except for

      Spitbol      Ada

      Abort        Cancel
      Rem          Rest

where we have clashes with Ada reserved names.

Ada requires the use of 'Access to refer to functions used in the pattern match, and often the use of 'Unrestricted_Access may be necessary to get around the scope restrictions if the functions are not declared at the outer level.

The actual pattern matching syntax is modified in Ada as follows:

      Spitbol      Ada

      X Y          Match (X, Y);
      X Y = Z      Match (X, Y, Z);

and pattern failure is indicated by returning a Boolean result from the Match function (True for success, False for failure).

Tutorial Examples that do not work

The first example in the example section will not work because it requires that Digs and Lnum have the data type of pattern. However, Lnum could be written as a constant string:
Lnum = "Pos(0) & Span('1234567890') & '.' & Span(' ')"
Notice the implied rules. Patterns are strings enclosed by double quotes and cannot contain variables. Another way to rewrite Lnum:
Digs = "Span('1234567890')"
Lnum = "Pos(0) & %s & '.' & Span(' ')" % Digs

The final extensive example does not work at all because it relies on defining and calling the function 'Gts'.

SnoPy Examples

The examples in this section are meant to show some specific examples which I have used and may be instructive. Additional examples can be seen in the test program test.py and SnoPy.py which also has some test cases.

Contents.py

This is a small python program which I wrote to scan an html file and to build at the top of the file a table of contents based on the heading entries included in the file. It makes use of one pattern to locate the heading entries and return the level of the heading, the displayed portion of the heading, and the whole entry. The pattern is compiled before entering the main loop of the program. There is a single call the match function in the loop. The real program is in the distribution directory. And yes I did write and use it to generate the table of contents for this web page. It may be of interest to note that there are a couple of place where I use string.find and string.replace in place of SNOBOL pattern matching. Use the right tool in the right place.

#! /usr/bin/env python

# Program to generate a table of contents for an html file
# It works by finding and modifying heading statements in the
# file.  It is a usage example of the SnoPy package.

# Invocation:
#
#       contents.py file_name
#
# generates contenst.html
#
# Operation is to read the html file and rename it as a backup file.
# The file is then scanned while (1) dropping any table of contents stuff
# between the H1 heading and the first H2 heading, and (2) finding all
# subsequent Headings.  Each heading is augmented with an anchor and a
# link to that heading is created.
#
# All lines of the html file prior to the H1 heading are immediately written
# out to a new file with the given file name as are the anchor links as they
# are generated.  The writing of the subsequent lines are deffered until all
# lines of the file have been processed.

import sys
import copy
import os
import time
import SnoPy
import pprint

pp = pprint.PrettyPrinter()
html_lines = []


def w(file, string, deferred_out):
    if deferred_out:
        html_lines.append(string)
    else:
        file.write(string + "\n")

def print_deferred(file):
    ll = len(html_lines)
    for i in range(ll):
        w(file, html_lines[i], 0)

def main():
    global input
    if len(sys.argv) != 2:  # Check arguments
        print "No file name given."
        sys.exit()
    file_name = sys.argv[1]
    file_name = 'explain'
    start_time = time.time()
    if file_name[-1] == ".":
        file_name = file_name[0:-1]
    input  = open(file_name + ".html", 'r')
    lines = input.readlines()
    # File has been read, now rename it.
    os.rename(file_name+'.html', file_name+'.html.BAK')
    out_new = open(file_name + '.html', 'w')

    ret = SnoPy.createpattern(
        "('<h'&any('234')**N2&'>'&arb**N1&'</h'&any('234')&'>')**N3", 1)
    # The above pattern means:
    # Look for a substring starting with '<h' followed by a 2 or a 3 or
    # a 4 followed by '>' follow by an arbitrary string followed by </h followed
    # by a 2 or a 3 or a 4 followed by >.

    # N1 contains the displayed string or 'Arbitrary String Of Char'
    # N2 contains the level or in this example '2'
    # N3 is the whole matching substring.
    # The whole program is built around this pattern.

    deferred_out = 0
    skip_table_contents = 0
    for line in lines:
        s_line = line.rstrip()
        m = SnoPy.matchc(s_line, 1)
        if m:
            # Found a heading of H2 or higher
            deferred_out = 1
            # Replace all blanks in N1
            id = copy.deepcopy(m.strings[1])
            id = id.replace(" ","_")
            nl = '<a name="%s"></a>%s' % (id, m.strings[3])
            dup = int(m.strings[2])
            # What the link to the anchor will look like.
            cl = '%s<a href="#%s">%s</a><br>' % (' '*(dup-1), id,
                                                m.strings[1])
            w(out_new, cl, 0)
            w(out_new, nl, deferred_out)
            if dup < 1:
                skip_table_contents = 0
        else:
            if skip_table_contents == 0:
                w(out_new, s_line, deferred_out)
            if s_line.find('<h1>') > -1:
                skip_table_contents = 1

    #pp.pprint(html_lines)
    print_deferred(out_new)
    end_time = time.time()
    print 'elapsed time = %s' % (end_time - start_time)

if __name__ == '__main__':
    main()

Cutting lines at the margin

Often when printing out long strings, it is desirable to split the line at the last blank prior to the right margin. This little Python function will do that. In the distribution directory, there is a more general version.


import SnoPy

s = "The pattern is matched against the subject string, and either the " + \
   "match fails, or it succeeds matching a contiguous substring. If a " + \
   "replacement string is specified, then the subject string is modified "\
   + "by replacing the matched substring with the given replacement."


def cut_at_margin(s,margin):
    print_patt = ("pos(0)&((pos(%s)&' ') or (breakx(' ')&len(1)))" +\
               "**output&(arbno(notany(' '))&pos(%s)&rest)**n1") % \
               (margin, margin)
    leading_blank_patt = "pos(0)&' '&span(' ')"
    SnoPy.createpattern(print_patt, 1)
    SnoPy.createpattern(leading_blank_patt, 2)
    x = []
    m = SnoPy.matchc(s, 1)
    while m:
        ss = m.strings[1]
        # Chuck any leading blanks.
        n = SnoPy.matchc_replace(s, 2, "")
        if n:
            ss = m.strings[0]
        m = SnoPy.matchc(ss, 1)
    if ss != '': print ss

if __name__ == '__main__':
    cut_at_margin(s, 40)

Implementation

Perhaps the most important decision was to decide which syntax to use. There are three different syntaxes involved. First, there was the original SNOBOL syntax for specifying patterns. The Ada folks found it necessary to change the syntax to use Ada operators for concatenation, alternation, etc. And finally, Python uses still different notation. I decided to use the Ada syntax primarily because of the excellent tutorial included above. In addition, all of the necessary compromises have been well thought through.

Interfaces

SnoPy is implemented by routines written in three languages, Python, C, and Ada. That means that there are two interfaces, between Python and C, and between C and Ada. The first interface is handled by SWIG, which I consider to be one of the really great programming tools that I have encountered. The interface between C and Ada is accomplished directly by gnat, the Ada compiler built on gcc. Gnat provides a package, 'Interfaces.C'. That provides conversions between various C data types and corresponding Ada data types.

The SnoPy interface contains the following calls:


def createpattern(pattern_string, index):
def match(target_string, pattern):
def matchc(target_string, index):
def match_replace(target_string, pattern, replacement_string):
def matchc_replace(target_string, index, replacement_string):

The Swig interface is defined by the file 'snobol.i' This interface supports the following calls to C.


#define L 10
#define M 5
struct ret_s {
  int success;
  int no_strings;
  int no_cur;
  int replacement;
  struct {
    char s[1024];
  } sub[10];
  int cur[6];
};

int createpattern(int index, char * pattern);
struct ret_s * match(char * target, char * pattern);
struct ret_s * matchi(char * target, int patt_index);
struct ret_s * matchr(char * target, char * pattern, char * replacement);
struct ret_s * matchri(char * target, int patt_index, char * replacement);

SnoPy.py

This is a very simple program the main function of which is to create and return the SnoPy object when a match is successful or an indication of success or failure for the createpattern call. Included are some tests.

The Swig Interface Specification

This is not too bad. The tricky part was to convert the structure containing the returned values into the tuple that is passed back to the Python interface, SnoPy.

I used a the Python extension calls to

Here is that code.


%typemap(python,out) struct ret_s * {
  int len_str, len_no, i;
  int ret_code;
  int kk;
  struct ret_s {
	int success;
	int no_strings;
	int no_cur;
	int replacement;
	struct {
	  char s[1024];
	} sub[10];
	int cur[6];
  };
  struct ret_s *ret;
  PyObject * list_str;
  PyObject * list_no;
  ret = (struct ret_s *) $source;
  $target = PyTuple_New(3);  // The tuple has three elements.
  if (ret->success == 0) {
  }
  len_str = ret->no_strings; // The number of strings to be returned.
  if (ret->replacement) {
    if (len_str < 0)
      len_str = 0;
  }
  // Build a list of the returned strings.
  list_str = PyList_New(len_str+1); // Create the list of strings.
	for (i = 0; i < len_str+1; i++) { // Move each string into the list.
	  PyList_SetItem(list_str,i,PyString_FromString(ret->sub[i].s));
	}
  len_no = ret->no_cur;
  list_no = PyList_New(len_no+1);
  if (len_no) {
	for (i = 0; i < len_no+1; i++) {
	  PyList_SetItem(list_no,i,PyInt_FromLong(ret->cur[i]));
	}
  }
  // Make item 0 success or failure of the match.
  PyTuple_SetItem($target, 0, PyInt_FromLong(ret->success));
  // Make item 1 the string list in the tuple.
  PyTuple_SetItem($target, 1, list_str);
  // Make item 2
  PyTuple_SetItem($target, 2, list_no);
}

The C Program

Essentially, I defined a set of operations for creating patterns and use a C program to convert a pattern string into an array of those operations. Flex and bison are used to generate the that program. The input files are lex.1 and snobol.y. The operation array is passed to parse which interprets the operations and generates the pattern. I keep looking for a function in the Ada SPITBOL.PATTERNS package that will convert a string constant having the form of a pattern into a pattern but I haven't found one.

The program is has 5 entry points: createpattern, match, matchi, matchr, and matchri. createpattern, match, and matchr are called with a pattern string. For those calls, the pattern string is parsed to create the pattern array which is passed to parse. The other two entry points are called with an index which is just passed to parse.

Each of the calls to the C program generates the parameters to be passed to the Ada program. The most important step in doing that for the calls supplying a pattern string is to pass that string to yyparse which parses the string and build the array of parse generation operations.

The correspondence between the C calls and the Ada functions is given by


   pragma Export(C, MatchPatt, "matchpatt");
   pragma Export(C, MatchPattRepl, "matchrepl");
   pragma Export(C, CreatePattern, "createpatt");
   pragma Export(C, MatchPattIndex, "matchpattindex");
   pragma Export(C, MatchPattReplIndex, "matchreplindex");

Following the call to the Ada program, the C interface merely returns.

Essentially the same thing is done for the createpattern call except that the parameter list is simpler and only an indication of success or failure is returned.

Parse

This is the Ada stuff that I had to write to

This program was a challenge for me because it is the only Ada I have ever written and represents the only exposure that I have had to Ada. I am sure that it is not good Ada and I would welcome suggestions for improvement. The more specific the better.