Pattern matching syntax

Discussion:

(too old to reply)

Florian Weimer

2011-02-26 16:11:23 UTC

There is a well-known syntactic ambiguity in pattern matching: it is
unclear where identifier denotes a data constructor, a bound
variable, or a free variable. This is a problem if a datatype is
changed because exhaustiveness checking does not reliably identify all
places which need updating.

I know about several different solutions to this problem.

The SML way: An identifier which is has been bound as a data
constructor cannot be used in a pattern anymore. This is a non-local
property which cannot be inferred without analyzing all dependencies
(particularly due to the "open" construct). No attempt is made at
reducing ambiguities. Some implementations check for redundant
patterns, but this warning does not fire in all relevant cases.

The Erlang way: Erlang hasn't got datatypes, so this aspect does not
arise. However, it is possible to parametrize the pattern by
referencing bound variables. The submatch only succeeds if the
subvalue matches the value of the variable. This is a localized
property which can be statically inferred by looking at one file in
isolation in Erlang because variables do not leak across top-level
function boundaries.

The Haskell and Caml way: Data constructors and variable identifiers
occupy distinct lexical categories, and patterns always introduce new
bindings.

The Ada way: Do not implement pattern matching at all, and provide a
syntactically distinct alternative for the default case.

The shell way: Prefix all variable names with "$".

Creating syntactically distinct categories seems to be the best
approach. However, it does not work if you're introducing pattern
matching to an existing language where variable identifiers cannot be
restricted conveniently.

One approach is absent from the list above: syntax in the pattern (and
only in the pattern) to indicate whether an identifier introduces a
new variable, or refers to an existing data constructor (or perhaps
variable). ML-like languages introduce variables through pattern
matching, so this is not substantially different from the Haskell or
shell approach. But is there a non-ML language where something like
this has been implemented?

Jon Harrop

2011-02-26 22:30:50 UTC

Permalink

Yes. Mathematica does it. Variables are written:

x

whereas patterns are written:

x_

There are also syntaxes for matching sequences of one or more elements (x__)
and zero or more elements (x___). This is of particular value in Mathematica
because, as a CAS, you need to be able to match over the variable "x"
verbatim on the lhs as well as being able to match a value and refer back to
it as "x" on the rhs.

For example, the pattern:

x_ + y_

matches any expression that is the addition of two subexpressions that are
then bound to the symbols "x" and "y". In contrast:

x + y

matches the expression that is the addition of the symbol "x" to the symbol
"y".

Cheers,
Jon.

Post by Florian Weimer
There is a well-known syntactic ambiguity in pattern matching: it is
unclear where identifier denotes a data constructor, a bound
variable, or a free variable. This is a problem if a datatype is
changed because exhaustiveness checking does not reliably identify all
places which need updating.
I know about several different solutions to this problem.
The SML way: An identifier which is has been bound as a data
constructor cannot be used in a pattern anymore. This is a non-local
property which cannot be inferred without analyzing all dependencies
(particularly due to the "open" construct). No attempt is made at
reducing ambiguities. Some implementations check for redundant
patterns, but this warning does not fire in all relevant cases.
The Erlang way: Erlang hasn't got datatypes, so this aspect does not
arise. However, it is possible to parametrize the pattern by
referencing bound variables. The submatch only succeeds if the
subvalue matches the value of the variable. This is a localized
property which can be statically inferred by looking at one file in
isolation in Erlang because variables do not leak across top-level
function boundaries.
The Haskell and Caml way: Data constructors and variable identifiers
occupy distinct lexical categories, and patterns always introduce new
bindings.
The Ada way: Do not implement pattern matching at all, and provide a
syntactically distinct alternative for the default case.
The shell way: Prefix all variable names with "$".
Creating syntactically distinct categories seems to be the best
approach. However, it does not work if you're introducing pattern
matching to an existing language where variable identifiers cannot be
restricted conveniently.
One approach is absent from the list above: syntax in the pattern (and
only in the pattern) to indicate whether an identifier introduces a
new variable, or refers to an existing data constructor (or perhaps
variable). ML-like languages introduce variables through pattern
matching, so this is not substantially different from the Haskell or
shell approach. But is there a non-ML language where something like
this has been implemented?

Torben Ægidius Mogensen

2011-02-28 09:11:14 UTC

Permalink

Strictly speaking, you can use it in a pattern, but only as a
constructor.

Post by Florian Weimer
This is a non-local property which cannot be inferred without
analyzing all dependencies (particularly due to the "open" construct).
No attempt is made at reducing ambiguities. Some implementations
check for redundant patterns, but this warning does not fire in all
relevant cases.

I think this is one of the few design flaws in Standard ML, which is
otherwise one of the best thought-out languages I know.

Post by Florian Weimer
The Haskell and Caml way: Data constructors and variable identifiers
occupy distinct lexical categories, and patterns always introduce new
bindings.
Creating syntactically distinct categories seems to be the best
approach. However, it does not work if you're introducing pattern
matching to an existing language where variable identifiers cannot be
restricted conveniently.

If a constructor is applied to arguments in a pattern, it is easy to see
that it can not be a variable. So one solution would be to drop nullary
constructors, so you instead of NIL would write NIL (), both in patterns
and expressions. This is, however, rather verbose, so you could add a

Post by Florian Weimer
One approach is absent from the list above: syntax in the pattern (and
only in the pattern) to indicate whether an identifier introduces a
new variable, or refers to an existing data constructor (or perhaps
variable). ML-like languages introduce variables through pattern
matching, so this is not substantially different from the Haskell or
shell approach. But is there a non-ML language where something like
this has been implemented?

I have considered the same: Maybe all binding occurences of names should
be prefixed with a lambda (written as \ in ASCII). Basically, a
function definition would have the syntax

FunDef -> Match
| Match '|' FunDef

Match -> Pat '=>' Exp

Pat -> '\' Id
| Id
| Id Pat
| '(' Pats ')'

Pats -> Pat
| Pat ',' Pats

An Id with no arguments can, however, still be either a nullary
constructor or a variable. But a nullary constructor is in this context
not really different from a variable: In a pattern, you compare the
value to the value of the variable or constructor and in an expression,
you return the value of the variable or constructor. Basically, nullary
constructors are small integers with special types that restricts the
operations on values to comparison.

The main disadvantage of this is that simple function definitions become
more complicated:

fib \n = if n<2 then n else fib (n-1) + fib (n-2)

Also, how would you distinguish this from a definition of the form
Pat = Exp? You can prefix definitions with 'fun' or 'var' as in SML,
but that is a bit verbose. But since fib is bound here, we should maybe
write

\fib \n = if n<2 then n else fib (n-1) + fib (n-2)

as a shorthand for

\fib = \n => if n<2 then n else fib (n-1) + fib (n-2)

But that becomes a bit cryptic. Maybe it is better to prefix/postfix
constructors (and non-binding variables) in patterns with special
symbols instead, so simple function definitions look "normal".

Torben

Lauri Alanko

2011-03-05 19:37:18 UTC

Permalink

Post by Florian Weimer
There is a well-known syntactic ambiguity in pattern matching: it is
unclear where identifier denotes a data constructor, a bound
variable, or a free variable.
One approach is absent from the list above: syntax in the pattern (and
only in the pattern) to indicate whether an identifier introduces a
new variable, or refers to an existing data constructor (or perhaps
variable). ML-like languages introduce variables through pattern
matching, so this is not substantially different from the Haskell or
shell approach. But is there a non-ML language where something like
this has been implemented?

The pattern-matching engines in various schemes, e.g.
<http://docs.racket-lang.org/reference/match.html>.

There are two classes of patterns. In one, a variable-binding pattern
is simply x and a constructor pattern (a symbol, in practice) is 'x.
The other class is quasipatterns, where a variable-binding pattern is
,x and a symbol is a plain x.

Lauri