F# vs OCaml vs Python vs Haskell: hash table performance

Post by Jon Harrop
http://flyingfrogblog.blogspot.com/2009/04/f-vs-ocaml-vs-haskell-hash-table.html
Haskell's hash table implementation is even slower than Python's and 32x
slower than .NET's.
Apparently the idiomatic workaround is to use immutable trees that are still
an order of magnitude slower than a decent hash table implementation.

Is there a way for the hash table to have a mutable array as a "spine"
by using a monad?

--
Regards,
Casey

Jon Harrop

2009-04-05 04:55:28 UTC

Post by Casey Hawthorne

Post by Jon Harrop
Haskell's hash table implementation is even slower than Python's and 32x
slower than .NET's.

Is there a way for the hash table to have a mutable array as a "spine"
by using a monad?

Apparently this is a performance bug in the GHC garbage collector so I do
not believe that would help to address this problem.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

André Thieme

2009-04-09 20:51:49 UTC

Isn't it a bit too general to compare languages, while it's in reality
the plattforms which you are comparing?
The F# hash table performance is the C# one. And if Haskell were
implemented on .NET, then its performance would probably match the C#
one. So, .NET vs. GHC?

André

--
Lisp is not dead. It’s just the URL that has changed:
http://clojure.org/

Jon Harrop

2009-04-10 18:26:56 UTC

Post by AndrÃ© Thieme
Isn't it a bit too general to compare languages, while it's in reality
the plattforms which you are comparing? The F# hash table performance is
the C# one.

That is true, yes.

Post by AndrÃ© Thieme
And if Haskell were implemented on .NET, then its performance would
probably match the C# one. So, .NET vs. GHC?

Haskell cannot be implemented on .NET with reasonable efficiency.

Microsoft employ most of the world's Haskell developers and they
collaborated in an attempt to get languages like Haskell running on .NET
but they ended up conceding in favour of an eager functional language (F#).

The important point is that these benchmark results demonstrate a core
inefficiency in the defacto-standard Haskell implementation.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Paul Rubin

2009-04-10 23:52:10 UTC

Haskell cannot be implemented on .NET with reasonable efficiency....
The important point is that these benchmark results demonstrate a core
inefficiency in the defacto-standard Haskell implementation.

I don't see why you describe that as a Haskell problem rather than a
.NET problem.

Jon Harrop

2009-04-11 12:41:33 UTC

I don't see why you describe that as a Haskell problem rather than a
.NET problem.

Because it is Haskell's loss.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Manlio Perillo

2009-04-11 13:03:06 UTC

I don't see why you describe that as a Haskell problem rather than a
.NET problem.

Because it is Haskell's loss.

Well, but since the Hash Table you use in F# is written in C#, you should
*also* compare performances with an Haskell program using an Hash Table
written in C.

Regards Manlio

namekuseijin

2009-04-11 17:33:44 UTC

I don't see why you describe that as a Haskell problem rather than a
.NET problem.

Because it is Haskell's loss.

Ye, because it' not running on Microsoft's user-end platform it
doesn't exist, right?

Jon Harrop

2009-04-11 17:47:24 UTC

I don't see why you describe that as a Haskell problem rather than a
.NET problem.

Because it is Haskell's loss.

Ye, because it' not running on Microsoft's user-end platform it
doesn't exist, right?

That is why Haskell is eating F#'s dust, yes.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

namekuseijin

2009-04-12 03:12:39 UTC

I don't see why you describe that as a Haskell problem rather than a
.NET problem.

Because it is Haskell's loss.

Ye, because it' not running on Microsoft's user-end platform it
doesn't exist, right?

That is why Haskell is eating F#'s dust, yes.

F# is not here to try out new ideas in PLT, it's a design using tested
and true ideas. It's still a heavily based imperative design with ML
syntax and grabbing a few good ideas from OCaml as well, in typical
Microsoft "flattery" fashion. It's just another option for
selling .NET-based solutions.

Haskell is eating no dust from a language which has nothing new to
offer except performance from an imperative-based VM and framework and
official support from an IDE. I thought one of the ideas behind
Haskell and GHC was to try to achieve performance comparable to
imperative programs while remaining purely functional.

Jon Harrop

2009-04-12 11:41:51 UTC

Post by namekuseijin
Ye, because it' not running on Microsoft's user-end platform it
doesn't exist, right?

That is why Haskell is eating F#'s dust, yes.

F# is not here to try out new ideas in PLT, it's a design using tested
and true ideas.

Units of measure? Active patterns?

Post by namekuseijin
It's still a heavily based imperative design...

A good thing.

Post by namekuseijin
with ML syntax and grabbing a few good ideas from OCaml as well,

Another good thing.

Post by namekuseijin
in typical Microsoft "flattery" fashion. It's just another option for
selling .NET-based solutions.

Whereas Haskell is not an option.

Post by namekuseijin
Haskell is eating no dust from a language which has nothing new to
offer except performance from an imperative-based VM and framework and
official support from an IDE.

Haskell is eating dust whether you like it or not.

Post by namekuseijin
I thought one of the ideas behind
Haskell and GHC was to try to achieve performance comparable to
imperative programs while remaining purely functional.

Another part of the failed experiment that was Haskell.

Haskell only appears to survive because a tiny number of people play the
system and make a lot of noise in order to make it look that way. 1,200
libraries in Hackage but no hash table? Don't make me laugh. Fourteen 5
star reviews on Amazon.com vs near-universal condemnation on Reddit:

http://www.reddit.com/r/programming/comments/8btml/is_real_world_haskell_really_a_good_book/

"I came away from RWH still not able to make heads or tails of the
language."

"I must say I'm disappointed."

"It is ironical that this is exactly how not to do software development"

"i have a decent prior knowledge of functional programming and even i
found myself questioning the presentation of topics."

"I, too, think that RWH is probably overrated."

"It was rushed out to cash in on the Haskell fad"

Gee, I wonder how that happened?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Bruce Stephens

2009-04-12 21:53:17 UTC

Jon Harrop <***@ffconsultancy.com> writes:

[...]

1,200 libraries in Hackage but no hash table? Don't make me
laugh.

Sure, why not?

Maybe people developing with Haskell just don't find many uses for
hash tables?

Presumably it would be possible to implement a hash table in Haskell
(for example using a monadic wrapper around a conventional
implementation), but then using it would tend to infect lots of code
with that statefulness, which seems unlikely to be popular.

[...]

Jon Harrop

2009-04-13 01:39:12 UTC

Post by Bruce Stephens

1,200 libraries in Hackage but no hash table? Don't make me
laugh.

Sure, why not?

Hash tables are one of the most important data structures and are the only
performant way to implement a dictionary in most cases.

Post by Bruce Stephens
Maybe people developing with Haskell just don't find many uses for
hash tables?

More likely: people don't find many uses for Haskell.

Post by Bruce Stephens
Presumably it would be possible to implement a hash table in Haskell
(for example using a monadic wrapper around a conventional
implementation), but then using it would tend to infect lots of code
with that statefulness, which seems unlikely to be popular.

Haskell's purity created that (huge) problem and they have since invented
many elaborate workarounds designed to get back to where impure languages
already are. If you look at the code for the current hash table
implementation it is not abhorrent.

After all, if it were an inherently bad idea they would presumably not have
bothered writing and bundling the (sucky) hash table they offer today.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Stephen J. Bevan

2009-04-15 02:23:44 UTC

Post by Jon Harrop
Hash tables are one of the most important data structures and are the only
performant way to implement a dictionary in most cases.

Rather than argue "most" do you have some particular applications in
mind? One application of interest to me is how quickly one can add or
find a session in a firewall based on a key containing the classic
5-tuple :-

struct session_key {
struct in_addr src_addr;
struct in_addr dst_addr;
unsigned short src_port;
unsigned short dst_port;
unsigned char proto;
};

To make the test somewhat realistic there would be say somewhere
between 1 and 8 million sessions with the distribution not being
anything like random e.g. it would be common to see > 80% of the keys
with a proto of 6, and 80 will be by far the most common dst_port.

A hash table is an obvious and thus common way of implementing this.
However, worst case can bite you badly if on collision you do a linear
probe to find a free bucket or a linear search over linked entries
unless you have an excellent hash function and you can afford to make
the hash table have approximately the same number of slots as the
maximum you expect to put in there. An alternative to a fixed size
table is to grow the table to keep the desired ratio but that will
require re-hashing and locking the table to re-hash a million entries
is not really feasible if your firewall is receiving say a million
packets a second. Incremental resize is possible but it means during
the resize you need to do two probes. Sticking with a fixed size
table but trying to improve the worst case one might be tempted to use
a balanced tree (e.g. AVL) instead of a list to handle collisions.
However, once a balanced tree is introduced the buckets start to look
less useful from big-O perspective and whether they are actually
useful can only be determined by testing the specific application.

So, in summary I'm not convinced that hash tables are one of the most
important data structures or the only performant way to implement a
dictionary. I would be convinced by code for the above problem which
showed a hash table is the best.

Paul Rubin

2009-04-15 06:28:44 UTC

Post by Stephen J. Bevan
A hash table is an obvious and thus common way of implementing this.
However, worst case can bite you badly if on collision you do a linear
probe to find a free bucket or a linear search over linked entries
unless you have an excellent hash function and you can afford to make
the hash table have approximately the same number of slots as the
maximum you expect to put in there.

Making the hash table bigger won't help you in an application like
that, if the queries are carefully chosen to cause collision. The
average case doesn't matter, it's the worst case that kills you.

See: http://www.cs.rice.edu/~scrosby/hash/

Adrian Hey

2009-04-16 15:57:53 UTC

Hello Stephen

Post by Stephen J. Bevan
So, in summary I'm not convinced that hash tables are one of the most
important data structures or the only performant way to implement a
dictionary. I would be convinced by code for the above problem which
showed a hash table is the best.

This has been a matter of some interest to me over the years. I'm also
sceptical about hash tables in general.

Here's my 2p (a few personal conclusions/opinions)

Tries of one kind or another are the best option for non-trivial keys
(No. of key bits >> machine word size).

Balanced binary trees become attractive for trivial keys (No. of key
bits <= machine word size).

AVL trees are the best performing balanced binary trees I know of.
Unfortunately they are not the simplest and the code needed to implement
them efficiently can be somewhat mind boggling.

In "purely functional" (I.E. immutable) implementations, most naive
Tries suffer performance problems because of the "deep tree" issue
(there are lot of nodes between the root and leaves, hence high heap
burn rate for trivial updates). This can be mitigated by using less
naive Trie implementations (e.g. use of PATRICIA Tries), though these
don't solve the problem completely (for sufficiently "pathalogical" key
sets). For mutable implementations I guess this is not a problem.

I suspect that in practice AVL trees (vs. Tries) are the best option for
keys that are somewhat larger than 1 machine word, exactly where AVL
trees start losing out to Tries is something to be determined by
experiment (depends on a lot of implementation details).

For keys that can easily be serialised into a list of machine words
(such as your session_key example) a good data structure IMO would be a
PATRICIA Trie, using an AVL tree (of machine word/sub-trie pairs) at
each Trie branch point. I suspect you would be invulnerable to the
pathalogical key set problem as all your keys would be the same length
(or something like that :-)

That said, the Haskell Data.IntMap actually uses a PATRICIA Trie for
machine word sized keys. This seems to be somewhat slower than an AVL
tree (specialised using unboxed Ints) for lookup but about the same
or faster for other operations. So it may not be a bad choice either.
http://www.haskell.org/pipermail/libraries/2005-October/004518.html
The really good thing about the Data.IntMap implementation is that
it's *much* simpler than the AVL code. Also, I can't help suspecting
that the relatively poor lookup performance of IntMap (vs. AVL) may be
down to ghc not doing a particularly good job with all the bit fiddling
that IntMap requires (maybe that's better these days).

I believe the Clojure folks have made use of an interesting data
structure that's probably worth a look too:
http://lamp.epfl.ch/papers/idealhashtrees.pdf

But the real problem with all these is different "efficient" data
structures is that it's so hard to do really objective benchmarking of
them (doing a decent implementation of just one of them is non-trivial
exercise). So I guess all efficiency claims have to be taken with a
pinch of salt.

Regards
--
Adrian Hey

Jon Harrop

2009-04-18 20:41:32 UTC

Post by Jon Harrop
Hash tables are one of the most important data structures and are the
only performant way to implement a dictionary in most cases.

Interesting.

Post by Stephen J. Bevan
To make the test somewhat realistic there would be say somewhere
between 1 and 8 million sessions with the distribution not being
anything like random e.g. it would be common to see > 80% of the keys
with a proto of 6, and 80 will be by far the most common dst_port.
A hash table is an obvious and thus common way of implementing this.

Sounds like a good first approximation to me. :-)

Post by Stephen J. Bevan
However, worst case can bite you badly if on collision you do a linear
probe to find a free bucket or a linear search over linked entries
unless you have an excellent hash function and you can afford to make
the hash table have approximately the same number of slots as the
maximum you expect to put in there.

As an aside, the GC in HLVM uses a hash table during the mark phase. I
discovered that it is substantially faster to use a (surprisingly) small
array of buckets with buckets that are arrays with up to 500 elements due
to deliberate collisions. I believe this is because linear search of a
contiguous bucket is now fast compared to fetching a bucket because it is
far more cache coherent.

So I would certainly represent buckets as arrays and not lists (or trees).

Post by Stephen J. Bevan
An alternative to a fixed size
table is to grow the table to keep the desired ratio but that will
require re-hashing and locking the table to re-hash a million entries
is not really feasible if your firewall is receiving say a million
packets a second.

You want soft real-time performance for insertion. Certainly not something
hash tables are traditionally good at, I agree.

Post by Stephen J. Bevan
Incremental resize is possible but it means during
the resize you need to do two probes.

That should still be a *lot* faster than a tree, in any language.

Post by Stephen J. Bevan
Sticking with a fixed size
table but trying to improve the worst case one might be tempted to use
a balanced tree (e.g. AVL) instead of a list to handle collisions.
However, once a balanced tree is introduced the buckets start to look
less useful from big-O perspective and whether they are actually
useful can only be determined by testing the specific application.

I think you're barking up the wrong tree here.

Your keys are easily made into short sequences of relatively-small ints so
I'd use a trie and probably start with each sub-dictionary in the trie
being a hash table to save space. The point is that you have now replaced
one large hash table where insertions could be O(n) cost with many smaller
hash tables where "n" is effectively bounded to only a few hundred and
resizes will take only ~10us.

You could improve that further at the expense of memory consumption by
replacing the hash tables with arrays where possible, e.g. when looking up
the bytes in an IP address. That eliminates the risk of collisions
entirely.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Paul Rubin

2009-04-18 21:10:40 UTC

Post by Jon Harrop
You could improve that further at the expense of memory consumption by
replacing the hash tables with arrays where possible, e.g. when looking up
the bytes in an IP address. That eliminates the risk of collisions
entirely.

That lets the other guy exhaust your memory by spoofing IP addresses
that force the creation of too many arrays. It gets even worse with
128-bit IPV6 addresses.

Jon Harrop

2009-04-19 13:35:39 UTC

Post by Jon Harrop
You could improve that further at the expense of memory consumption by
replacing the hash tables with arrays where possible, e.g. when looking
up the bytes in an IP address. That eliminates the risk of collisions
entirely.

That lets the other guy exhaust your memory by spoofing IP addresses
that force the creation of too many arrays. It gets even worse with
128-bit IPV6 addresses.

Yes. You get to choose how much memory he can exhaust though. The more
memory, the faster it gets.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Stephen J. Bevan

2009-04-19 02:27:43 UTC

Post by Jon Harrop
Hash tables are one of the most important data structures and are the
only performant way to implement a dictionary in most cases.

Rather than argue "most" do you have some particular applications in
mind? One application of interest to me is how quickly one can add or
find a session in a firewall ...

[snip]

Post by Jon Harrop
So I would certainly represent buckets as arrays and not lists (or trees).

Best or average case performance is not particularly interesting in a
firewall, the issue is worst case performance. Thus I assume you
have a solution to ensure that the worst case performance doesn't go to
hell if the array in a particular bucket is full.

Post by Stephen J. Bevan
Incremental resize is possible but it means during
the resize you need to do two probes.

That should still be a *lot* faster than a tree, in any language.

Can I assume at this point you contend that a hash table will give the
best performance for this problem -- with or without having to deal
with the issue of re-sizing. Did you code it and measure the
performance or is this based on results using a hash table in other
problem domains, particularly ones where best/average case performance
is measured and not worst case performance?

Post by Jon Harrop
Your keys are easily made into short sequences of relatively-small
ints so I'd use a trie and probably start with each sub-dictionary
in the trie being a hash table to save space. ...

Again, did you code and test this or is this speculation?

Jon Harrop

2009-04-19 13:36:21 UTC

Post by Jon Harrop
Hash tables are one of the most important data structures and are the
only performant way to implement a dictionary in most cases.

Rather than argue "most" do you have some particular applications in
mind? One application of interest to me is how quickly one can add or
find a session in a firewall ...

[snip]

Post by Jon Harrop
So I would certainly represent buckets as arrays and not lists (or trees).

You can just double the size of the bucket and copy the elements. That is
one allocation and a handful of copied elements.

However, in your application there will probably never be any collisions
using the approach I described.

Post by Stephen J. Bevan
Incremental resize is possible but it means during
the resize you need to do two probes.

That should still be a *lot* faster than a tree, in any language.

The relevant performance measurements are already graphed in both OCaml for
Scientists (page 80) and F# for Scientists (page 88).

Again, did you code and test this or is this speculation?

It is based upon real performance measurements.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Stephen J. Bevan

2009-04-19 14:29:38 UTC

Post by Jon Harrop
However, in your application there will probably never be any collisions
using the approach I described.

Probably is not good enough. Either there are not in which case it is
fine or there are in which case it must be possible to quantify the
worst case.

Post by Jon Harrop
The relevant performance measurements are already graphed in both OCaml for
Scientists (page 80) and F# for Scientists (page 88).

I don't use Ocaml or F# so books about them are not high on my reading
list. I can however run O'Caml code, so is the code available for
download so that the results can be verified?

Jon Harrop

2009-04-19 20:30:13 UTC

Post by Jon Harrop
However, in your application there will probably never be any collisions
using the approach I described.

Probably is not good enough. Either there are not in which case it is
fine or there are in which case it must be possible to quantify the
worst case.

The worst case is bounded so the only useful quantification would be actual
performance measurements. Suffice to say, you can alter the bounded size of
the keys to obtain any trade-off between hash table lookup with no
collisions and linear search of an array.

Post by Jon Harrop
The relevant performance measurements are already graphed in both OCaml
for Scientists (page 80) and F# for Scientists (page 88).

I don't use Ocaml or F# so books about them are not high on my reading
list. I can however run O'Caml code, so is the code available for
download so that the results can be verified?

The code is not available. However, if you are not using either of those
languages (or .NET) then the performance characteristics of their hash
table implementations is irrelevant. You can recreate the benefits in
almost any language though.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Stephen J. Bevan

2009-04-19 21:36:01 UTC

Post by Jon Harrop
The worst case is bounded so the only useful quantification would be actual
performance measurements. Suffice to say, you can alter the bounded size of
the keys to obtain any trade-off between hash table lookup with no
collisions and linear search of an array.

I agree actual performance measurements are the most useful measure,
that's why I wrote "I would be convinced by code for the above problem
which showed a hash table is the best." So far, the closest we've got
to code is :-

Post by Jon Harrop
The relevant performance measurements are already graphed in both OCaml
for Scientists (page 80) and F# for Scientists (page 88).

The code is not available. However, if you are not using either of those
languages (or .NET) then the performance characteristics of their hash
table implementations is irrelevant.

They are relevant because when I asked if you'd concluded that the
_worst-case_ performance of a hash table "should be a *lot* faster
than a tree" by actually measuring the performance you cited these
measurements. So either they show this or they do not. The fact that
I don't use O'Caml is irrelevant. I have it installed and will
happily re-run your tests to confirm/deny your findings. Given that
your book is targeted at scientists I would have thought you would
welcome third-party verification since that's a cornerstone of
science.

Post by Jon Harrop
You can recreate the benefits in almost any language though.

If there are indeed benefits, that is the assertion I'm questioning
since I've yet to see any evidence for it.

Jon Harrop

2009-04-19 23:13:29 UTC

Post by Jon Harrop
The code is not available. However, if you are not using either of those
languages (or .NET) then the performance characteristics of their hash
table implementations is irrelevant.

Here is the benchmark code used to generate the results I cited:

let time = Unix.gettimeofday

let list_init n f =
let rec aux n l =
if n=0 then l else
aux (n-1) (f n::l)
in
aux n []

let rec list_extract n = function
[] -> invalid_arg "list_extract"
| h::t ->
if n=0 then (h, t) else
let (e, t) = list_extract (n-1) t in
(e, h::t)

module Key =
struct
type t = int
let compare (x : int) y = if x<y then -1 else if x==y then 0 else 1
end

module Mapping = Map.Make(Key)

let array_map_inplace f a =
for i=0 to (Array.length a) - 1 do
a.(i) <- f a.(i)
done

let test empty filename insert =
let loops = 256 and max_n = 4096 in

let a = Array.init loops empty in

let ch = open_out filename in
output_string ch "{";
for n=0 to max_n do
print_string ((string_of_int n)^" ");
flush stdout;
let t = time () in
let insert = insert n in
array_map_inplace insert a;
let t = (time ()) -. t in
let t2 = time () in
array_map_inplace (fun i -> i) a;
let t2 = (time ()) -. t2 in
let t = 1e9 *. (t -. t2) /. (float_of_int loops) in
output_string ch ("{"^(string_of_int n)^", "^(string_of_float t)^"}");
if n<>max_n then output_string ch ", ";
done;
output_string ch "}";
close_out ch;
Gc.compact ()

let _ =
let f n = (n*71) mod 4096 in
test
(fun _ -> Mapping.empty)
"map_insert.dat"
(fun n -> Mapping.add (f n) (Random.float 1.));
test
(fun _ -> Hashtbl.create 1)
"hashtbl_insert.dat"
(fun n m -> Hashtbl.add m (f n) (Random.float 1.); m)

The output is in Mathematica format.

Post by Jon Harrop
You can recreate the benefits in almost any language though.

If there are indeed benefits, that is the assertion I'm questioning
since I've yet to see any evidence for it.

Look at the evidence I already cited.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Paul Rubin

2009-04-19 23:35:08 UTC

Post by Jon Harrop
(fun n -> Mapping.add (f n) (Random.float 1.));

If I understand properly, you are inserting a bunch of random keys.
That tests the average case, not the worst case. To test the worst
case, you have to analyze the hash function and generate a series of
keys that results in the maximum possible amount of collisions for
that particular hash function.

Jon Harrop

2009-04-20 19:12:57 UTC

Post by Jon Harrop
(fun n -> Mapping.add (f n) (Random.float 1.));

If I understand properly, you are inserting a bunch of random keys.
That tests the average case, not the worst case.

It tests one of the two kinds of worst case. Specifically, the worst case
O(n) insertion time that occurs when the hash table is resized.

Post by Paul Rubin
To test the worst
case, you have to analyze the hash function and generate a series of
keys that results in the maximum possible amount of collisions for
that particular hash function.

Collisions are the other worst case for hash tables. However, they are
irrelevant here because they are both bounded and controllable when using
the solution I described (a trie of hash tables).

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Jon Harrop

2009-04-20 20:25:38 UTC

To test the worst case, you have to analyze the hash function...

Just to clarify: the hash function is the identity function and the keys are
small ints in this case.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Stephen J. Bevan

2009-04-20 00:39:32 UTC

Post by Jon Harrop
You can recreate the benefits in almost any language though.

If there are indeed benefits, that is the assertion I'm questioning
since I've yet to see any evidence for it.

Look at the evidence I already cited.

The only evidence you've cited is your book and now (finally) some
code. From a cursory examination of the code and the O'Caml
implementations of Hashtable and Map two things seem clear :-

1. This test does not measure the worst case performance of the hash
table implementation.

2. The test compares the performance of a non-persistent hashtable
against a persistent map. The latter clearly creates a lot more
garbage than the former and so puts additional stress on the
O'Caml GC. Whether the additional GC work skews the results is not
clear.

Despite these issues I'll try out the code and also convert it to C
and use a non-persistent map to eliminate any quality of GC issues
from the measurements.

Jon Harrop

2009-04-20 19:18:00 UTC

Post by Jon Harrop
You can recreate the benefits in almost any language though.

If there are indeed benefits, that is the assertion I'm questioning
since I've yet to see any evidence for it.

Look at the evidence I already cited.

The sole purpose of the test is to measure worst case insertion time of
trees and hash tables. Perhaps you are referring to collisions but they are
irrelevant here.

Post by Stephen J. Bevan
2. The test compares the performance of a non-persistent hashtable
against a persistent map. The latter clearly creates a lot more
garbage than the former and so puts additional stress on the
O'Caml GC. Whether the additional GC work skews the results is not
clear.

True. I have not tested against mutable trees.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Stephen J. Bevan

2009-04-20 05:25:37 UTC

[snip]

Post by Jon Harrop
The output is in Mathematica format.

I don't have Mathematica and I don't have your book(s) so I don't know
what the graphs are supposed to look like so I tweaked the output so
that I can use gnuplot. The changes are :-

$ diff -u harrop-orig.ml harrop.ml
--- harrop-orig.ml 2009-04-19 17:41:15.000000000 -0700
+++ harrop.ml 2009-04-19 21:26:20.000000000 -0700
@@ -33,10 +33,7 @@
let a = Array.init loops empty in

let ch = open_out filename in
- output_string ch "{";
for n=0 to max_n do
- print_string ((string_of_int n)^" ");
- flush stdout;
let t = time () in
let insert = insert n in
array_map_inplace insert a;
@@ -45,10 +42,8 @@
array_map_inplace (fun i -> i) a;
let t2 = (time ()) -. t2 in
let t = 1e9 *. (t -. t2) /. (float_of_int loops) in
- output_string ch ("{"^(string_of_int n)^", "^(string_of_float t)^"}");
- if n<>max_n then output_string ch ", ";
+ output_string ch ((string_of_int n)^" "^(string_of_float t)^"\n");
done;
- output_string ch "}";
close_out ch;
Gc.compact ()

I didn't want to install O'Caml from scratch so I'm using the version
that comes with Ubuntu :-

$ ocamlopt -v
The Objective Caml native-code compiler, version 3.10.2
Standard library directory: /usr/lib/ocaml/3.10.2

If that has some know performance problem or otherwise warps the test
results then please indicate which version I should use.

$ ocamlopt -o harrop unix.cmxa harrop.ml
$ ./harrop

A simple linear plot shows :-

$ cat plot
set terminal dumb
plot "hashtbl_insert.dat"
plot "map_insert.dat"
$ gnuplot plot

2e+06 ++-----+-------+------+------+-------+------+------+-------+-----++
+ + + + + +"hashtbl_insert.dat" +AA +
1.8e+06 ++ ++
| |
1.6e+06 ++ ++
1.4e+06 ++ ++
| |
1.2e+06 ++ ++
| |
1e+06 ++ ++
| |
800000 ++ A ++
| |
600000 ++ ++
| |
400000 ++ A ++
200000 ++ ++
| A |
0 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ++
+ + + + + + + + + +
-200000 ++-----+-------+------+------+-------+------+------+-------+-----++
0 500 1000 1500 2000 2500 3000 3500 4000 4500

16000 ++------+------+-------+------+-------+------+-------+------+------++
+ + + + + + "map_insert.dat"+ AA +
14000 ++ A A A AA ++
| AAA AA AAAA AA |
12000 ++ A AA AA A A AAAAAAAA AAAAAAA AA ++
10000 ++ AAAAAAAAAAA AAAAAAAAAAAAA AAAA A AAAAAAA ++
| AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAA |
8000 ++ AAAAAAAAAAAAAAAA AAA A AAA A A ++
| A AAAAAAAAAAAA AAAA A A A A A AA |
6000 ++A AAAAAAAAA AA A AA AA AA A A A A AA AAAAAA AA AAAAA ++
|AAA AAAAAAAAAA AAAAAAAAAAAAAAAAAAAAA AAA AAAA AAAAAAAAA AAAAA |
4000 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA AAAAAAAA A AAA AA AAA ++
AAAAAAAAAAA A A A AA A AA AAAA A A AA AAA AAAAA |
2000 AAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ++
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA AAAAAAAA A A A |
0 ++ ++
-2000 ++ ++
| A |
-4000 ++ A ++
+ + + + + + + + + +
-6000 ++------+------+-------+------+-------+------+-------+------+------++
0 500 1000 1500 2000 2500 3000 3500 4000 4500

Some things to note from the graphs :-

1. Both runs contain negative numbers which seems to cast doubt on the
accuracy of the results. An artefact of running on 32-bit system
perhaps?

2. The maximum time for any of the map runs is < 16,000. The maximum
run time for the hash table is > 1,000,000.

3. There appears to be exponential running through the hash table results.

Given the huge variation in times for the hash table, a logarithmic
y axis seems in order and to clearly see whether there is an
exponential in the hash table results let's go logarithmic in x as
well :-

$ cat plot
set terminal dumb
set logscale y
set logscale x
plot "hashtbl_insert.dat"
plot "map_insert.dat"
$ gnuplot plot

1e+07 ++---+--+-+-++++++----+--+-++-+++++----+-+--++-++++----+--+-+-++++++
+ + + "hashtbl_insert.dat" A +
| |
+ A +
1e+06 ++ A ++
+ +
| A |
+ A +
100000 ++ ++
+ A +
| A |
+ A +
10000 ++ A A AA AA ++
+ A AAA AA AAAAA A AAAAA AAA +
| A A A A A A AAAAAA A AAAAA AAAAA A |
+ A A A A A A +
1000 ++ ++
+ A AAAAAAAAAAAAA +
| A A AAAAAAAAAAAAAAAAAAAA |
A A A A AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +
100 ++---+--+-+-++++++----+--+-++-+++++----+-+--++-++++----+--+A+-++++++
1 10 100 1000 10000

100000 ++---+--+-+-++++++----+--+-++-+++++----+-+--++-++++----+--+-+-++++++
+ + + "map_insert.dat" A +
+ +
+ +
| |
+ +
| AAAAAAAAA |
10000 ++ AA AAAAAAAAAAAAA ++
+ AA AAAAAAAAAAAAAAAAAAAA +
+ A A AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +
+ A AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A +
+ A A A A AAAAA A AA AAAAAAA AA A AAAAAAAAAAAA +
| A A A AA A AAAAAAAAAAAAA |
1000 ++ AAAAAAAAAAAAAAAAA ++
+ AAAAAAAAAAAAAAAAAA +
+ AAAAAAAAAAAAAA +
+ A AAAAA AAAAAAAAAAAAAAA AA A +
+ A A A AAAAAAA AAAAAAAAAAAAAAAA +
+ A AAA AA A A AAA +
A + + + +
100 ++---+--+-+-++++++----+--+-++-+++++----+-+--++-++++----+--+-+-++++++
1 10 100 1000 10000

That line right through the hash table results appears to confirm
that there is an exponential in the hash table performance.

So before I do any more tests, are these results consistent with what
you get? I'm assuming not since they clearly don't show hash table
performance in a good light.

s***@gmail.com

2009-04-20 19:53:29 UTC

Post by Stephen J. Bevan
That line right through the hash table results appears to confirm
that there is an exponential in the hash table performance.

s/exponential/quadratic/

Jon Harrop

2009-04-20 20:40:32 UTC

Post by Stephen J. Bevan
I didn't want to install O'Caml from scratch so I'm using the version
that comes with Ubuntu :-
$ ocamlopt -v
The Objective Caml native-code compiler, version 3.10.2
Standard library directory: /usr/lib/ocaml/3.10.2
If that has some know performance problem or otherwise warps the test
results then please indicate which version I should use.

That version should be fine.

Post by Stephen J. Bevan
Some things to note from the graphs :-
1. Both runs contain negative numbers which seems to cast doubt on the
accuracy of the results. An artefact of running on 32-bit system
perhaps?

Good catch. Your ints may well be overflowing.

Post by Stephen J. Bevan
2. The maximum time for any of the map runs is < 16,000. The maximum
run time for the hash table is > 1,000,000.

For large hash tables or maps, yes. That justifies your original assertion
that insertion time for large hash tables is often bad.

However, the method I proposed uses several small hash tables and the
relevant insertion time is under 100us in that case. I think that is
adequate for your application.

Post by Stephen J. Bevan
3. There appears to be quadratic running through the hash table results.

Should be O(n) worst case. Comparing the jumps at ~256 and ~4096 in OCaml I
get 16x larger hash table and 20x longer insertion time which is roughly
linear (give or take cache coherence).

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

s***@gmail.com

2009-04-20 22:17:36 UTC

Post by Stephen J. Bevan
1. Both runs contain negative numbers which seems to cast doubt on the
accuracy of the results. An artifact of running on 32-bit system
perhaps?

Good catch. Your ints may well be overflowing.

Urgh.

Post by Stephen J. Bevan
2. The maximum time for any of the map runs is < 16,000. The maximum
run time for the hash table is > 1,000,000.

For large hash tables or maps, yes. That justifies your original assertion
that insertion time for large hash tables is often bad.

I'm interested in cases where there are 10^6 entries and I think we
can both agree that qualifies as large and thus a single hash table is
not a performant solution. However, would you care to put an upper
bound on "large" and thus on the size of problems that a single hash
table is good for?

Post by Jon Harrop
However, the method I proposed uses several small hash tables and the
relevant insertion time is under 100us in that case. I think that is
adequate for your application.

Once you have a trie of X then for sufficiently small sizes of X then
the worst case performance is going to be consistent no matter what X
is, the question is how large can one make X without running into some
performance problem? I gave an example of an IPv4 tuple which is 13
bytes and so one _might_ consider 13 levels to keep each level down to
a maximum of 2^8 so that one might consider using an array (but this
is susceptible to DoS on memory usage) or a hash table (I assume 2^8
is not considered large and we can live with the upper bound of re-
hashing 2^8-1 elements when the last one is added). However the same
problem exists for an IPv6 tuple of 37 bytes and 37 levels looks
somewhat less appealing (make it ~40 if you also want to index by
ingress interface). For IPv4 one might consider using say 10 bits in
the trie key but that doesn't make a big dent in the number of levels,
even less so for IPv6.

Jon Harrop

2009-04-20 23:28:10 UTC

Post by s***@gmail.com

Post by Jon Harrop
For large hash tables or maps, yes. That justifies your original
assertion that insertion time for large hash tables is often bad.

Now that you are armed with real performance data you can choose an
acceptable trade-off. If you do it byte-wise then you have up to 256
elements in each hash table which gives stalls of only 0.1ms which is
probably acceptable.

Post by s***@gmail.com

Post by Jon Harrop
However, the method I proposed uses several small hash tables and the
relevant insertion time is under 100us in that case. I think that is
adequate for your application.

Well, IPv6 gives you a *huge* number of keys and I don't believe there is
anything you can do to counteract that. My point was simply that a trie of
dictionaries should always be much faster than a binary trie, e.g. 37 vs
296 indirections for IPv6.

Post by s***@gmail.com
For IPv4 one might consider using say 10 bits in
the trie key but that doesn't make a big dent in the number of levels,
even less so for IPv6.

Sure. The average case performance will be slightly better and the worst
case performance will be slightly worse.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Stephen J. Bevan

2009-04-21 02:26:57 UTC

Post by Jon Harrop
Now that you are armed with real performance data you can choose an
acceptable trade-off. If you do it byte-wise then you have up to 256
elements in each hash table which gives stalls of only 0.1ms which is
probably acceptable.

It is at the high end of what some customers expect the maximum
additional latency to be and means that everything else must be 0
time. So to be practical it really needs to be down at 0.01ms or
preferably 0.001ms.

I brought up the example to to see if it would cause you modify or
further qualify your claim about hash tables or perhaps explicitly
invoke the "most" qualifier that is there. It did in the sense that
you do not advocate a single hash table for this problem and instead
you advocate a trie of hash tables. However, you apparently do not
want to be drawn on the question of exactly when a hash table become
too large and thus need augmenting or replacing, even for your own
test program. Consequently, I'm not clear on whether you are sticking
to the claim that "hash tables are the only performant way to
implement a dictionary in most cases." Regardless I think the running
example shows that caveat lector should apply to that claim.

Post by Jon Harrop
Well, IPv6 gives you a *huge* number of keys and I don't believe
there is anything you can do to counteract that. My point was simply
that a trie of dictionaries should always be much faster than a
binary trie, e.g. 37 vs 296 indirections for IPv6.

Indeed, but then are those the only two possible implementations? A
balanced tree containing 10^6 elements has a worst case of 20
comparisons. The best case cost of each comparison can be improved
somewhat if one pre-hashes to generate a 32- or 64-bit value from the
37-40 bytes so that many levels only involve a single 32/64-bit
comparision. Of course sticking with worst-case performance then we
can ignore that and assume that we need to do the full key comparision
20 times. So the question becomes can we do 20 full key comparisons
in less time than 37-40 probes with whatever worst case bound one
puts on the hash tables[1]. While the trie&hash _may_ win that one, I
for one could not call it either way without writing the code and
testing it since the winner may depend on which is most cache
friendly for the particular cache/size being used.

------------------------
[1] The rehash on re-size + additional probes on collision
until/unless one is willing to re-size until the table is the same
size as the trie key in which case it isn't a hash table anymore.

Jon Harrop

2009-04-21 04:46:55 UTC

I think microsecond insertion time would be easier to achieve without hash
tables.

Post by Stephen J. Bevan
I brought up the example to to see if it would cause you modify or
further qualify your claim about hash tables or perhaps explicitly
invoke the "most" qualifier that is there.

I think it is fair to say that microsecond worst-case insertion time is not
a common requirement for a dictionary.

Post by Stephen J. Bevan
It did in the sense that
you do not advocate a single hash table for this problem and instead
you advocate a trie of hash tables. However, you apparently do not
want to be drawn on the question of exactly when a hash table become
too large and thus need augmenting or replacing, even for your own
test program.

That is a tautology. My test program is designed solely to empower
programmers to be able to choose when to make the trade-offs themselves
based upon real performance measurements. My books and articles are often
unusually rich in such content because I believe it is very valuable.

Post by Stephen J. Bevan
Consequently, I'm not clear on whether you are sticking
to the claim that "hash tables are the only performant way to
implement a dictionary in most cases." Regardless I think the running
example shows that caveat lector should apply to that claim.

Perhaps I can also qualify that keys are usually small constant-sized data
and the bottleneck operation for a dictionary is usually average-case
lookup and not worst-case insertion.

Perhaps. Once you've loaded the first bit of your key then you've loaded a
whole cache line though. Furthermore, comparison is likely to terminate
early, particularly if you put the most randomized bytes first.

Post by Stephen J. Bevan
Of course sticking with worst-case performance then we
can ignore that and assume that we need to do the full key comparision
20 times.

You originally said there were up to 8 million live sessions, which is 23
comparisons.

Post by Stephen J. Bevan
So the question becomes can we do 20 full key comparisons
in less time than 37-40 probes with whatever worst case bound one
puts on the hash tables[1].

Sounds like the trade-off occurs between IPv4 and IPv6. :-)

For IPv6, I agree that a balanced tree looks best (better than any kind of
trie or hash table).

Post by Stephen J. Bevan
While the trie&hash _may_ win that one, I
for one could not call it either way without writing the code and
testing it since the winner may depend on which is most cache
friendly for the particular cache/size being used.

Agreed but I think these are very unusual requirements.

Post by Stephen J. Bevan
[1] The rehash on re-size + additional probes on collision
until/unless one is willing to re-size until the table is the same
size as the trie key in which case it isn't a hash table anymore.

I disagree.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Stephen J. Bevan

2009-04-21 14:40:34 UTC

Post by Jon Harrop
I think it is fair to say that microsecond worst-case insertion time is not
a common requirement for a dictionary.

Maybe, I can only comment on the problems that I have to solve.

Post by Jon Harrop
That is a tautology. My test program is designed solely to empower
programmers to be able to choose when to make the trade-offs themselves
based upon real performance measurements. My books and articles are often
unusually rich in such content because I believe it is very valuable.

Your program only tests two structures, uses a simple key and used a
relatively small number of entries. That does't mean the result
wasn't useful for that test but I don't think it is empowering
programmers to draw any conclusions about hash tables in general from
that test unless it is clear that the test results scale up or there
is a qualifier indicating that they don't.

Post by Jon Harrop
Perhaps I can also qualify that keys are usually small constant-sized data

That would be a very good qualifier especially if small is quantified.
The simplest fixed size key in any problems I face is 8 bytes (20 for
IPv6) and typically much longer and and can be variable e.g. the from
tag, to tag and branch tag in a SIP call are all variable length
strings and collectively the minimum could be around 10 bytes but in
real calls typically would be upwards of 60 bytes.

Post by Jon Harrop
Perhaps. Once you've loaded the first bit of your key then you've loaded a
whole cache line though. Furthermore, comparison is likely to terminate
early, particularly if you put the most randomized bytes first.

That's the purpose of inserting the hash first but that's a best case
optimization everthing still has to work consistently for the worst
case where the hacker sees the code and hits you with many packets
that hash to the same small set of values.

Post by Jon Harrop
You originally said there were up to 8 million live sessions, which is 23
comparisons.

I agree using the lower bound of 10^6 gives a slight advantage to the
tree so let's be fair and use 10^8 and thus 23 comparisons. I don't
think it will change the outcome.

Agreed but I think these are very unusual requirements.

As noted above, we can only really speak to the problems we have to solve.
My problems may not be your problems or common when averaged over all
possible problems but it is a domain which whose services are being
used utilized as part of this conversation (at least at my end).

I disagree.

Since you didn't elaborate I assume you don't want to argue the point.

Adrian Hey

2009-04-14 15:24:04 UTC

Hello Jon,

Post by namekuseijin
I thought one of the ideas behind
Haskell and GHC was to try to achieve performance comparable to
imperative programs while remaining purely functional.

Another part of the failed experiment that was Haskell.

Haskell is not a failed experiment! Even if all that's been achieved
is a demonstration of how not to design a "purely functional"
programming language, it's still a successful experiment :-)

Whilst I would aggree (after 10 years or so of using the language) that
Haskell has many serious problems, I think you're missing the point
somewhat in your critisism. Sucky hash table libs, or map libs, or even
sucky performance generally are the least of these problems IMO (and
for the most part easily fixable).

I think the hash table implementation is known to suck, but nobody cares
because nobody wants to use it anyway. The purely functional Data.Map is
also known to suck, but the community is stuck with it because it's been
elevated to the status of (quasi-)standard by being bundled with ghc.
For the less conservative users there are better performing alternatives
in Hackage.

Also, I think your statement about hash tables being the *only* way to
get a performant dictionary is wrong. Tries should offer comparable
(often better) performance and have some nice properties that hash
tables don't give you.

IMO by far the biggest problem with Haskell is that, for all it's
sophistication, it's still very difficult to use it to write *robust*
programs (as opposed to programs that are merely "correct" in an
academic sense). But often you can live with this, if all you're
using it for is to knock up quick and dirty "run once" programs
to get whatever answer you're looking for. As long as the program
terminates eventually (as it usually does), who cares about all those
space leaks or the sucky performance?

I suspect that most Haskell users use the language this way (as I kind
of super calculator or for private "in house" stuff). I do anyway. As
you have observed before, what might be regarded as "serious"
publicly available applications written in Haskell seem to be pretty
rare :-( (ghc,pugs,darcs,xmonad are about all I can think of off the top
of my head).

The second biggest problem with Haskell is..something I dare not mention
again (it's tabboo in the Haskell community).

But I think Haskellers can take heart from the knowledge that you can
always be certain you've achieved something significant when people take
the time to bash you. It's better than the fate that awaits most
experimental or DIY language implementors (being completely ignored).

Regards
--
Adrian Hey

Larry Coleman

2009-04-16 13:31:47 UTC

Post by Adrian Hey
Hello Jon,

Post by namekuseijin
I thought one of the ideas behind
Haskell and GHC was to try to achieve performance comparable to
imperative programs while remaining purely functional.

Another part of the failed experiment that was Haskell.

Dr. Harrop has been around the functional language community long
enough to know that already.

Post by Adrian Hey
I suspect that most Haskell users use the language this way (as I kind
of super calculator or for private "in house" stuff). I do anyway. As
you have observed before, what might be regarded as "serious"
publicly available applications written in Haskell seem to be pretty
rare :-( (ghc,pugs,darcs,xmonad are about all I can think of off the top
of my head).

Don't forget that private "in house" stuff can be very useful to the
people who created and use it. Just because you can't see it doesn't
mean it isn't there.

Post by Adrian Hey
The second biggest problem with Haskell is..something I dare not mention
again (it's tabboo in the Haskell community).

Don't worry, the vast majority of the Haskell community isn't here;
they're on fa.haskell. (And BTW, I'm positive that Dr. Harrop knows
this, which is why the thread is here instead of there.) You may speak
freely.

Larry

Larry Coleman

2009-04-16 18:09:10 UTC

Post by Adrian Hey
Whilst I would aggree (after 10 years or so of using the language) that
Haskell has many serious problems, I think you're missing the point
somewhat in your critisism. Sucky hash table libs, or map libs, or even
sucky performance generally are the least of these problems IMO (and
for the most part easily fixable).
I think the hash table implementation is known to suck, but nobody cares
because nobody wants to use it anyway.

Dr. Harrop has been around the functional language community long
enough to know that already.

It turns out that Dr. Harrop was explicitly told that the hash table
implementation was not in common use (this from a comment on his blog,

Post by Larry Coleman
No one uses hashtables in haskell (they're stateful by nature, and most
people aren't willing to stick their otherwise-pure code into IO to use a
hashtable), so I believe their code hasn't been touched in years, and
is thus quite slow.

Jon Harrop

2009-04-18 19:09:41 UTC

Post by Larry Coleman
It turns out that Dr. Harrop was explicitly told that the hash table
implementation was not in common use (this from a comment on his blog,

That is a circular argument and a bad excuse for not having fixed this
problem, which has been complained about for many years.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Larry Coleman

2009-04-20 12:45:26 UTC

Post by Larry Coleman
It turns out that Dr. Harrop was explicitly told that the hash table
implementation was not in common use (this from a comment on his blog,

That is a circular argument and a bad excuse for not having fixed this
problem, which has been complained about for many years.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.http://www.ffconsultancy.com/?u

We seem to have differing definition of the term "circular argument."
AFAIK it is an argument where the conclusion is used as part of one of
the premises. OTOH the statement you're objecting to is not an
argument at all, but a generalization based on reported experience.

Adrian Hey

2009-04-17 13:08:28 UTC

Hello Larry,

Don't forget that private "in house" stuff can be very useful to the
people who created and use it. Just because you can't see it doesn't
mean it isn't there.

That was my point actually. Haskell has it's problems but it's still
useful. Do I really care if my Haskell prog is 10 times slower than the
same prog written in C? Usually not, if I can write it in 1/10th of the
time.

But then again, if I was writing something where program failure would
result in hugely expensive product recalls, lawsuits or general death
and mahem, I'd probably not chose Haskell (or C for that matter).

Post by Adrian Hey
The second biggest problem with Haskell is..something I dare not mention
again (it's tabboo in the Haskell community).

OK, I'm in the mood for some mischief making. I was talking about
this problem..

http://www.haskell.org/haskellwiki/Top_level_mutable_state

.. this proposed solution ..

http://www.haskell.org/pipermail/haskell-cafe/2004-November/007664.html

.. which has even been implemented in jhc ..

http://repetae.net/computer/jhc/manual.html#top-level-actions

.. and the embarrassing fact that the worlds finest imperative
programming language could never be used to implement most of its own
standard or non-standard IO infrastructure (well none of it that I can
think of).

At least that's in its "pure" form. If we allow unsafePerformIO to be
used in an *unsafe* way and hope the compiler doesn't muck things up,
then it is possible.

Mentioning this is always a good way to start a flame war. The result
is always that the moral majority (who I suspect have never tried
implementing any of this stuff) seem determined to deny the minority
(that have) the right to implement safe IO libs.

:-)

Regards
--
Adrian Hey

Jon Harrop

2009-04-18 19:11:20 UTC

Post by Larry Coleman
Don't forget that private "in house" stuff can be very useful to the
people who created and use it. Just because you can't see it doesn't
mean it isn't there.

I don't doubt that but what about Haskell vs OCaml, F#, Scala, Clojure or
Erlang?

Post by Larry Coleman
Don't worry, the vast majority of the Haskell community isn't here;
they're on fa.haskell. (And BTW, I'm positive that Dr. Harrop knows
this, which is why the thread is here instead of there.) You may speak
freely.

OK, I'm in the mood for some mischief making. I was talking about
this problem..
http://www.haskell.org/haskellwiki/Top_level_mutable_state
.. this proposed solution ..
http://www.haskell.org/pipermail/haskell-cafe/2004-November/007664.html
.. which has even been implemented in jhc ..
http://repetae.net/computer/jhc/manual.html#top-level-actions
.. and the embarrassing fact that the worlds finest imperative
programming language could never be used to implement most of its own
standard or non-standard IO infrastructure (well none of it that I can
think of).
At least that's in its "pure" form. If we allow unsafePerformIO to be
used in an *unsafe* way and hope the compiler doesn't muck things up,
then it is possible.
Mentioning this is always a good way to start a flame war. The result
is always that the moral majority (who I suspect have never tried
implementing any of this stuff) seem determined to deny the minority
(that have) the right to implement safe IO libs.
:-)

Ouch.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Adrian Hey

2009-04-20 12:08:43 UTC

Hello Jon,

Post by Adrian Hey
That was my point actually. Haskell has it's problems but it's still
useful. Do I really care if my Haskell prog is 10 times slower than the
same prog written in C? Usually not, if I can write it in 1/10th of the
time.

I don't doubt that but what about Haskell vs OCaml, F#, Scala, Clojure or
Erlang?

I think all these languages have one problem in common. That is, I don't
know very much about them :-)

Regards
--
Adrian Hey

Larry Coleman

2009-04-20 12:51:49 UTC

Post by Larry Coleman
Don't forget that private "in house" stuff can be very useful to the
people who created and use it. Just because you can't see it doesn't
mean it isn't there.

I don't doubt that but what about Haskell vs OCaml, F#, Scala, Clojure or
Erlang?

Well, since you ask, I recently wrote a SQL parser to use at work. My
first try at it was using F#, but parsing SQL using lex and yacc was
just too painful to think about, so I started on a monadic parser, and
switched to Haskell halfway through after I realized that I was re-
implementing Parsec on the cheap.

Jon Harrop

2009-04-20 18:41:40 UTC

Post by Jon Harrop
I don't doubt that but what about Haskell vs OCaml, F#, Scala, Clojure or
Erlang?

The F# tools and libraries for parsing currently suck. However, if you'll
willing to endure parser combinators, why did you not try FParsec (the F#
translation of Haskell's Parsec)?

Moreover, OCaml has far better tools and libraries for parsing than either
F# or Haskell (IMHO).

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Larry Coleman

2009-04-20 20:23:30 UTC

Post by Jon Harrop
I don't doubt that but what about Haskell vs OCaml, F#, Scala, Clojure or
Erlang?

The F# tools and libraries for parsing currently suck. However, if you'll
willing to endure parser combinators, why did you not try FParsec (the F#
translation of Haskell's Parsec)?

Endure parser combinators? You say that like it's a bad thing! :-)

I did look at FParsec briefly, and I don't remember why I didn't use
it. That's not the only thing I didn't like about F#, however. This is
probably just a matter of personal preference, but Haskell strings are
also lists, which means that you can map and do other list-type things
out of the box. OTOH, I don't very often do random access on strings,
so having them be arrays is kind of a drawback. Again, this is just
personal preference on my part.

Haskell has type classes. I know Ocaml and F# have an equivalent, but
in Haskell you don't need to make a module to do it.

Also, is FParsec one of the tools that currently suck? If so, why on
Earth should I try it?

Post by Jon Harrop
Moreover, OCaml has far better tools and libraries for parsing than either
F# or Haskell (IMHO).

Excuse my ignorance, but I thought that Ocaml source would generally
work in F#, so what tools/libraries exist that couldn't be ported?

Jon Harrop

2009-04-20 21:09:23 UTC

Post by Jon Harrop
I don't doubt that but what about Haskell vs OCaml, F#, Scala, Clojure
or Erlang?

The F# tools and libraries for parsing currently suck. However, if you'll
willing to endure parser combinators, why did you not try FParsec (the F#
translation of Haskell's Parsec)?

Endure parser combinators? You say that like it's a bad thing! :-)

Parser combinators are fine for some things but I very rarely choose them
for parsing strings because OCaml offers so much more high-level and
performant alternatives with better error checking.

Post by Larry Coleman
I did look at FParsec briefly, and I don't remember why I didn't use
it. That's not the only thing I didn't like about F#, however. This is
probably just a matter of personal preference, but Haskell strings are
also lists, which means that you can map and do other list-type things
out of the box. OTOH, I don't very often do random access on strings,
so having them be arrays is kind of a drawback. Again, this is just
personal preference on my part.

Hmm, strings are enumerables in F# so you can use the Seq module to work on
them sequentially like a list. However, you cannot deconstruct them because
they are internally mutable (argh, run away!).

Having said that, I just memory map the entire string and pass around an
index into it as an immutable int. That is both extremely fast and trivial
to backtrack. I've also been playing with parallelized parsers...

Post by Larry Coleman
Haskell has type classes. I know Ocaml and F# have an equivalent, but
in Haskell you don't need to make a module to do it.

Kind of. The sets of problems that type classes and functors solve well are
(IMHO) non-overlapping though. Type classes are great for things like
operator overloading but they are really abused for many more sophisticated
things in Haskell because they don't have a module system. Functors can be
used in ML to parameterize over arithmetic types and so forth but that is
so incomprehensible that it is useless but, in contrast, they solve the
sophisticated problems (like abstracting over "equivalent" data structures)
really elegantly. Okasaki's book on purely functional data structures is an
excellent example of this. For example, you can construct your catenable
list using any kind of queue, like an amortized-time queue for the best
average-case performance or a "real-time" queue for the best worst-case
performance and so on.

F# has fallen very much on the Haskell side of the fence in this respect: no
module system at all, not even hierarchical module signatures for
encapsulation (!), but a limited form of type classes for overloaded
arithmetic. The cost of abstraction is predictable in F# but,
unfortunately, it is predictably bad because it falls down to virtual
dispatch tables instead of the nice static solution offered by a module
system.

I'm still not convinced that operator overloading is needed at all to be
honest. It completely undermines composability in F# because it requires
type annotations.

I'm also surprised by how little money F# is pulling in for us: less than
OCaml now...

Post by Larry Coleman
Also, is FParsec one of the tools that currently suck? If so, why on
Earth should I try it?

Because you're willing to endure Parsec which sucks just as much. :-)

Post by Jon Harrop
Moreover, OCaml has far better tools and libraries for parsing than
either F# or Haskell (IMHO).

Excuse my ignorance, but I thought that Ocaml source would generally
work in F#, so what tools/libraries exist that couldn't be ported?

Haha. No. :-)

The F# ecosystem is nowhere near the maturity of OCaml and, consequently,
has almost no tools and libraries to speak of. Moreover, as a closed world
it can only evolve as quickly as Microsoft can make it, which is
(surprisingly) far slower than OCaml because they have very limited
resources.

Tangentially, I was trying to ossify this theory recently. If Microsoft have
10,000 optimally-directed developers working towards a well-defined goal,
how many open source developers does it take to achieve the same goal in
the same time accidentally via a random walk?

Anyway: In OCaml, you have ocamllex, ocamlyacc, camlp4, dpygen, menhir and a
dozen other rock-solid and ridiculously-fast parser generators that have
been used in industry for the best part of a decade. In F#, you have fslex
which is ocamllex without named sub-regexps (a vitally important feature!)
and without any IDE support, and FParsec and nothing else. Moreover, they
are both many times slower than OCaml.

Maybe this will improve in the future but Microsoft are taking steps to
prevent third parties from improving F# and they disagree with most of my
beliefs about what should be done.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Erik de Castro Lopo

2009-04-21 00:11:06 UTC

Post by Jon Harrop
Moreover, OCaml has far better tools and libraries for parsing than either
F# or Haskell (IMHO).

I have tried ocamllex/ocamlyacc, dypgen, menhir and aurouchs. I tried these
when I had already been using Ocaml for a number of years. Ocaml was my
language of choice.

I then tried Parsec with Haskell, a language I had not previously used and
found Parsec/Haskell at least an order or magnitude easier to get things
done in that any of the thing I had used with Ocaml.

Erik

--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/

Jon Harrop

2009-04-21 02:13:18 UTC

Post by Erik de Castro Lopo

Post by Jon Harrop
Moreover, OCaml has far better tools and libraries for parsing than
either F# or Haskell (IMHO).

I have tried ocamllex/ocamlyacc, dypgen, menhir and aurouchs. I tried
these when I had already been using Ocaml for a number of years. Ocaml was
my language of choice.
I then tried Parsec with Haskell, a language I had not previously used and
found Parsec/Haskell at least an order or magnitude easier to get things
done in that any of the thing I had used with Ocaml.

Wow. That's really interesting.

What kinds of things were you parsing?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Erik de Castro Lopo

2009-04-21 02:40:18 UTC

Post by Jon Harrop
Wow. That's really interesting.
What kinds of things were you parsing?

Actionscript which is a Javascript-like language. I was difficult
because actionscript includes inline regexes with a syntax much
like Perl's

var re = /test-\d/i;

and inline XML so you can do:

var x = <tag1>
<tag2 attrib="X">data</tag2>
<tag2 atriib="Y">something else</tag2>
</tag1> ;

See:

http://livedocs.adobe.com/flash/9.0/ActionScriptLangRefV3/RegExp.html
http://livedocs.adobe.com/flash/9.0/ActionScriptLangRefV3/XML.html

Erik

--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/

Ertugrul Söylemez

2009-04-20 10:47:11 UTC

What's wrong with mutable state?

data AppCfg = AppCfg {
someString :: String,
someInt :: Int }

type AppIO = StateT AppCfg IO

Now AppIO is IO with implicit state. You can initialize it from IO and
write the rest of your program in AppIO:

getAppCfg :: IO AppCfg
getAppCfg = do
r <- randomIO
return AppCfg { someString = "blah", someInt = r }

main :: IO ()
main = getAppCfg >>= evalStateT myApp

myApp :: AppIO ()
myApp = ...

In the context of AppIO, this is global state. You can use it through
the usual state manipulation functions 'get' and 'put'. The auxilliary
functions 'gets' and 'modify' are very useful here. If your state is
not mutable, use ReaderT instead of StateT, which may improve
performance a bit.

Of course, you need to know Haskell and its base library to some extent
to come up with such solutions. Most people arguing against the
language don't have a clue.

I'm using Haskell for quite some time now and I'm happy with it. I know
it's bashed a lot, but usually I don't really care, since I know its
value and most people bashing it don't have written a single line of
Haskell code.

Its performance is very good compared to most other functional
languages. Compared to a well written C implementation of an algorithm,
you can expect about 50%-80% of that performance, when implemented in
Haskell. The code is short and concise. The only real problem is lack
of libraries. Also some general convention for OOP would be useful,
maybe together with a number of language constructs, even though they
aren't strictly necessary.

Why prefer Haskell over OCaml or F#? That's personal taste, probably.
I prefer Haskell's syntax and I like its purity. It has some unique
features, which OCaml and F# don't have. For example to know how a
function is used, in almost all cases it suffices to look at its type.
To know how to solve a problem, constructing its type usually gives the
solution right away. That makes incorrect Haskell code fail to compile
in many cases, which helps avoiding run-time bugs.

There are a few people, who make some valid points, like Dr. Harrop.
However, even though some of his points are valid, his conclusions are
entirely wrong. He concludes that Haskell is totally useless for real
applications.

Greets,
Ertugrul.

--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://blog.ertes.de/

Adrian Hey

2009-04-20 16:28:32 UTC

Hello Ertugrul

Post by Ertugrul SÃ¶ylemez
data AppCfg = AppCfg {
someString :: String,
someInt :: Int }
type AppIO = StateT AppCfg IO
Now AppIO is IO with implicit state. You can initialize it from IO and
getAppCfg :: IO AppCfg
getAppCfg = do
r <- randomIO
return AppCfg { someString = "blah", someInt = r }
main :: IO ()
main = getAppCfg >>= evalStateT myApp
myApp :: AppIO ()
myApp = ...
In the context of AppIO, this is global state. You can use it through
the usual state manipulation functions 'get' and 'put'. The auxilliary
functions 'gets' and 'modify' are very useful here. If your state is
not mutable, use ReaderT instead of StateT, which may improve
performance a bit.

Try using this to eliminate all 23 uses of the "unsafePerformIO hack"
you can find in the ghc distributed libs..

http://www.haskell.org/pipermail/haskell-cafe/2008-September/046940.html

The only reason "global variables" (so called) are not even more
apparent in the Haskell source code than they already are is that they
are themselves heavily dependent on C or OS provided infrastructure.

What if all these libs were implemented in Haskell, all the way down to
hardware?

What would the resulting library APIs look like?

What would main look like for non trivial program that one way or
another was dependent on all these libs + a few others?

There are so many things wrong with this kind of approach it's hard to
know where to start really, other than to say I think it's totally
lacking in modularity, maintainability and possibly also safety too.

In contrast there is nothing inherently wrong or unsafe about allowing
(monomorphic) IORefs, MVars, Chans etc in top level expressions, other
than the ridiculous dogma about "global variables" (which is not
something unique to the Haskell community I might add).

I say they are not "global variables" (a meaningless term if ever there
was). What they are is a modular way of obtaining static guarantees
of identity correctness (1 per process). Considering getting static
correctness guarantees is a perennial obsession of the Haskell community
it seems strange that so many folk want to dismiss this as being
inherently "evil".

But as I hinted at before, I think part of the problem is that most in
the community are IO lib consumers, not producers. Maybe if they spent
less time delivering lectures about the evils of global variables and
more time implementing this stuff themselves they would have a better
understanding of the problem.

All that said, if you look at the Xmonad source code you can see that
the authors have done precisely what you suggest. It's just about doable
if you are writing the top level application (I.E. you are the person in
charge of main), but highly unmodular IMO as it means all the program
state has to be gathered up into 1 centralised "here is my entire
program state" module. (And of course in the case of Xmonad the story
is incomplete, given it's dependence on other libs with access to
hidden top level state.)

But if you're writing an IO *library*, you have no control over or
access to the applications main. There is only 1 reasonable monad to
use (IO) and somehow from *within* that library you have to prevent
users from doing anything that's going to cause the world to explode,
no matter how hard they try.

Post by Ertugrul SÃ¶ylemez
Of course, you need to know Haskell and its base library to some extent
to come up with such solutions. Most people arguing against the
language don't have a clue.

Does he mean *me* I wonder? :-)

Well fortunately for the future of Haskell there is one implementor who
does "get it" (John Meacham of course) and I would be inclined to
suspect that he does have a clue :-)

Unfortunately for the future of Haskell, jhc is not the poster child for
Haskell', ghc is :-(

Regards
--
Adrian Hey

Ertugrul Söylemez

2009-04-21 21:45:06 UTC

Post by Ertugrul SÃ¶ylemez
data AppCfg = AppCfg {
someString :: String,
someInt :: Int }
type AppIO = StateT AppCfg IO
Now AppIO is IO with implicit state. You can initialize it from IO
getAppCfg :: IO AppCfg
getAppCfg = do
r <- randomIO
return AppCfg { someString = "blah", someInt = r }
main :: IO ()
main = getAppCfg >>= evalStateT myApp
myApp :: AppIO ()
myApp = ...
In the context of AppIO, this is global state. You can use it
through the usual state manipulation functions 'get' and 'put'. The
auxilliary functions 'gets' and 'modify' are very useful here. If
your state is not mutable, use ReaderT instead of StateT, which may
improve performance a bit.

Try using this to eliminate all 23 uses of the "unsafePerformIO hack"
you can find in the ghc distributed libs..
http://www.haskell.org/pipermail/haskell-cafe/2008-September/046940.html

Perhaps it's just me, but I've never used the unsafePerformIO hack.

Post by Adrian Hey
The only reason "global variables" (so called) are not even more
apparent in the Haskell source code than they already are is that they
are themselves heavily dependent on C or OS provided infrastructure.
What if all these libs were implemented in Haskell, all the way down
to hardware?
What would the resulting library APIs look like?
What would main look like for non trivial program that one way or
another was dependent on all these libs + a few others?

It has been shown that a complete operating system can be implemented
entirely in Haskell. Have a look at House [1].

[1] http://programatica.cs.pdx.edu/House/

Post by Adrian Hey
There are so many things wrong with this kind of approach it's hard to
know where to start really, other than to say I think it's totally
lacking in modularity, maintainability and possibly also safety too.

It's lacking libraries and I wish to have real OOP in Haskell without
needing a DSL or something. That's it. Both the language and the
runtime system (at least of GHC) are very safe for most senses of
safety.

Of course as well I might be misunderstanding your idea of modularity.
It has a mature module system, its own package management (Cabal), etc.

Post by Adrian Hey
In contrast there is nothing inherently wrong or unsafe about allowing
(monomorphic) IORefs, MVars, Chans etc in top level expressions, other
than the ridiculous dogma about "global variables" (which is not
something unique to the Haskell community I might add).
I say they are not "global variables" (a meaningless term if ever
there was). What they are is a modular way of obtaining static
guarantees of identity correctness (1 per process). Considering
getting static correctness guarantees is a perennial obsession of the
Haskell community it seems strange that so many folk want to dismiss
this as being inherently "evil".

There is something wrong with global variables in the Haskell paradigm,
because since everything is pure, a global modifyable variable isn't
even expressable. If you really need them, either you use StateT, which
is perfectly safe, pure and modular (remember that multiple StateTs can
be combined) or you use the ugly variant of essentially the same, an
unsafePerformIO hack.

Post by Adrian Hey
[...]
All that said, if you look at the Xmonad source code you can see that
the authors have done precisely what you suggest. It's just about
doable if you are writing the top level application (I.E. you are the
person in charge of main), but highly unmodular IMO as it means all
the program state has to be gathered up into 1 centralised "here is my
entire program state" module. (And of course in the case of Xmonad the
story is incomplete, given it's dependence on other libs with access
to hidden top level state.)

As noted before, StateTs can be combined. Combining monads is precisely
the point of monad transformers. What I've done in my example above is
combining State and IO. You can combine State, State, State, Maybe,
Parser and IO if you wish.

Post by Adrian Hey
But if you're writing an IO *library*, you have no control over or
access to the applications main. There is only 1 reasonable monad to
use (IO) and somehow from *within* that library you have to prevent
users from doing anything that's going to cause the world to explode,
no matter how hard they try.

For some reason, people refrain from using StateT here, but prefer to
use unsafePerformIO. I think, one reason is that it makes the library
appear easier to use. The CGI people have done it properly. The CGI
monad is essentially IO together with State.

Post by Ertugrul SÃ¶ylemez
Of course, you need to know Haskell and its base library to some
extent to come up with such solutions. Most people arguing against
the language don't have a clue.

Does he mean *me* I wonder? :-)

No. =)

I didn't mean anyone specific.

Post by Adrian Hey
Well fortunately for the future of Haskell there is one implementor
who does "get it" (John Meacham of course) and I would be inclined to
suspect that he does have a clue :-)
Unfortunately for the future of Haskell, jhc is not the poster child
for Haskell', ghc is :-(

Honestly I've never tried JHC. I'll give it a shot, when I've got some
time. =)

Greets,
Ertugrul.

--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://blog.ertes.de/

Adrian Hey

2009-04-22 09:11:11 UTC

Hello Ertugrul,

Post by Ertugrul SÃ¶ylemez
Perhaps it's just me, but I've never used the unsafePerformIO hack.

Have you ever written an IO lib? A non-trivial FFI binding perhaps?
I'm not saying that even if you have you would necessarily have needed
to use the unsafePerformIO hack. Sometimes you don't need it, but
often use of a top level MVar or something (to implement a lock
say) really is the *only* semantically correct and safe solution,
unless you're going to just say the lib is not thread safe and
should not be used in concurrent apps (a pretty severe limitation
IMO). Unfortunately Haskell provides no safe way to do this.

Even some of those that have managed to avoid it only do so by
presenting an inherently unsafe API (e.g. SDL)

BTW, anyone who has ever used stdout has used a "global variable"
created using the unsafePerformIO hack. Or at least ghc users, take a
look at the GHC.Handle source. Funny thing is, hardly anyone even
recognises stdout,stdin,stderr as "global variables". Or they do, but
for some mysterious reason (never explained) these and *only* these are
not "evil".

Post by Ertugrul SÃ¶ylemez
It has been shown that a complete operating system can be implemented
entirely in Haskell. Have a look at House [1].
[1] http://programatica.cs.pdx.edu/House/

You mean the "Principled Approach to Operating System Construction in
Haskell"? :-)

Again, perhaps you should take a look at the source. Even for this
modest OS I can count at least 8 uses of the "unsafePerformH hack" to
create "global variables". There would be even more if I included the
cbits of this entirely written in Haskell OS. Not that I would ever be
one to suggest that this is in any way unprincipled.

:-)

Some conclusions re. the suitablility of Haskell in it's current form
for such projects can be found in these papers:

http://web.cecs.pdx.edu/~rebekah/papers/icfp05_h.pdf
http://web.cecs.pdx.edu/~rebekah/papers/plos07.pdf

No real surprises there, but strangely neither of these papers mentions
the dependence on "global variables" nor the reliance on an unsound hack
to create them. I would have thought a minor problem like not even being
able to rely on correct compilation might have been worth a mention in
passing :-)

It all adds to my suspicion that there is a real conspiracy of silence
about this problem in the research community, like this is the problem
that nobody wants (dares) to mention or admit exists. The silence from
ghchq on this issue is also "deafening" (as they say). I really can't
imagine why that might be.

Post by Ertugrul SÃ¶ylemez
Of course as well I might be misunderstanding your idea of modularity.

I mean that without top level state (a lock say) I have to provide a
library API that is not only less excessively complex and less
convenient to use and maintain, it is also less safe.

For the lock example, instead of this..

module LockDemo (doSomething) where

lock :: MVar ()
lock <- newMVar () -- Hypothetical top level <- binding

doSomething :: IO Whatever
doSomething = do ...
takeMVar lock
<call some dodgy C>
<call some more dodgy C>
putMVar lock ()
...

...I have to use something like this..

module LockDemo (Lock,newLock,doSomething) where

newtype Lock = Lock (MVar ())

-- | If you make more than one of these you're screwed
newLock :: IO Lock
newLock = do mvar <- newMVar ()
return (Lock mvar)

doSomething :: Lock -> IO Whatever
doSomething (Lock lock) = do ...
takeMVar lock
<call some dodgy C>
<call some more dodgy C>
putMVar lock ()
...

I have now have no static guarantee that only one lock will be created.
I have no alternative but to "beg" users for safe use via a Haddock
comment. As any libs that may interface to me cannot assume they
have a monopoly on my services they must in turn "pass the buck" of
uniqueness responsibility on in their own APIs, and so on..until we
end up at main itself. Not exactly modular IMO.

Furthermore, assuming everybody has respected the uniquiness properties,
I still have to ensure that doSomething gets the correct lock passed. So
if I have several locks, the only safe way to do this is to make them
distinct types.

Other libs might well be modified to make use of my services too at some
point, but they can't do this without changing their own APIs because
they need to take the Lock(s) as arguments. Also, suppose the C became
less dodgy in a later revison. It would be nice to eliminate the locks,
but now I can't do that without changing the API (or at least there's
little point in doing this if I'm not going to simplify the API). So
it's not exactly maintainable either IMO.

And no, I don't think exotic library or application level monads help
solve this problem :-(

And no again (before anybody says it), the problem is not with "badly
designed" C libs. Many of them certainly are, but if you stripped away
all the "badly designed" C libs and "badly designed" OS interfaces
you'd just end up interfacing directly to "badly designed" hardware.

Haskell should be able to safely interface to any "badly designed"
environment at whatever system level, or else depend only on "well
designed" interfaces that could never actually be implemented in Haskell
(not the characteristics of a general purpose programming language).

Post by Ertugrul SÃ¶ylemez
There is something wrong with global variables in the Haskell paradigm,
because since everything is pure, a global modifyable variable isn't
even expressable. If you really need them, either you use StateT, which
is perfectly safe, pure and modular (remember that multiple StateTs can
be combined) or you use the ugly variant of essentially the same, an
unsafePerformIO hack.

Sorry I don't really know how to answer this, but it looks to me as if
you don't understand the problem people are trying to solve when they
use the unsafePerformIO hack. If so, you would certainly not be the
first to misunderstand this which is why it is so hard to have a
sensible discussion about this problem (people have already made up
their minds about the "right way to do it" before they've even taken
the time to understand the problem). I don't know if my words above have
clarified this at all :-)

Remember the concurrency semantics of MVars, Chans and functions like
atomicModifyIORef? I don't think StateT is any substitute for a genuine
MVar or IORef.

Post by Ertugrul SÃ¶ylemez
For some reason, people refrain from using StateT here, but prefer to
use unsafePerformIO.

I think they use they unsafePerformIO hack because it is the only way
to get a unique MVar or whatever at the top level and that is what they
need :-)

Post by Ertugrul SÃ¶ylemez
Honestly I've never tried JHC. I'll give it a shot, when I've got some
time. =)

Well I don't use it myself either. I think at the moment it's very much
work in progress, so I don't know if it really is realistically useable
(yet).

Regards
--
Adrian Hey

Manlio Perillo

2009-04-22 10:40:53 UTC

Post by Adrian Hey
BTW, anyone who has ever used stdout has used a "global variable"
created using the unsafePerformIO hack.

Do you mean that a definition like:

stdoutFD = 0

is a global variable?

Post by Adrian Hey
[...]

Regards Manlio

Adrian Hey

2009-04-22 14:53:46 UTC

Hello Manillo,

Post by Manlio Perillo
stdoutFD = 0
is a global variable?

No, it clearly isn't. I mean a definition like this is global
variable..

stdout :: Handle
stdout = unsafePerformIO $ do
-- ToDo: acquire lock
-- We don't set non-blocking mode on standard handles, because it may
-- confuse other applications attached to the same TTY/pipe
-- see Note [nonblock]
(buf, bmode) <- getBuffer fd_stdout WriteBuffer
mkStdHandle fd_stdout "<stdout>" WriteHandle buf bmode

(Taken from GHC.Handle)

You can tell a Handle is mutable data structure (and hence that
stdout must be a "global variable") by looking at the Haddock..

http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#t%3AHandle

Regards
--
Adrian Hey

Manlio Perillo

2009-04-23 15:58:15 UTC

[...]
No, it clearly isn't. I mean a definition like this is global variable..
stdout :: Handle
stdout = unsafePerformIO $ do
-- ToDo: acquire lock
-- We don't set non-blocking mode on standard handles, because it
may -- confuse other applications attached to the same TTY/pipe --
see Note [nonblock]
(buf, bmode) <- getBuffer fd_stdout WriteBuffer mkStdHandle
fd_stdout "<stdout>" WriteHandle buf bmode
(Taken from GHC.Handle)
You can tell a Handle is mutable data structure (and hence that stdout
must be a "global variable") by looking at the Haddock..

Ah, ok.
By the way, Handle also uses an IORef.

P.S.: after a quick search I found this article
http://www.cs.chalmers.se/~rjmh/Globals.ps

Regards Manlio

Adrian Hey

2009-04-24 08:02:40 UTC

Hello Manlio,

Post by Manlio Perillo
P.S.: after a quick search I found this article
http://www.cs.chalmers.se/~rjmh/Globals.ps

Yes, that is a very interesting paper. I've always been a bit sceptical
of that solution too (for reasons I won't bore everyone with) but I did
read it again last night. I didn't change MHO on it, but I see it also
does tend to confirm my suspicions about the scalebility of the monad
transformer approach too :-(

I guess for Haskell's IO lib implementors the only realistic solution
for the foreseeable future is to keep on using the unsafePerformIO hack
and just pray that the compiler gods will forgive their evil ways :-)

Regards
--
Adrian Hey

Ertugrul Söylemez

2009-04-22 22:46:52 UTC

Post by Ertugrul SÃ¶ylemez
Perhaps it's just me, but I've never used the unsafePerformIO hack.

Have you ever written an IO lib? [...]

Post by Ertugrul SÃ¶ylemez
Of course as well I might be misunderstanding your idea of modularity.

Post by Ertugrul SÃ¶ylemez
There is something wrong with global variables in the Haskell
paradigm, because since everything is pure, a global modifyable
variable isn't even expressable. If you really need them, either
you use StateT, which is perfectly safe, pure and modular (remember
that multiple StateTs can be combined) or you use the ugly variant
of essentially the same, an unsafePerformIO hack.

Sorry I don't really know how to answer this, but it looks to me as if
you don't understand the problem people are trying to solve when they
use the unsafePerformIO hack. If so, you would certainly not be the
first to misunderstand this which is why it is so hard to have a
sensible discussion about this problem (people have already made up
their minds about the "right way to do it" before they've even taken
the time to understand the problem). I don't know if my words above
have clarified this at all :-)
Remember the concurrency semantics of MVars, Chans and functions like
atomicModifyIORef? I don't think StateT is any substitute for a
genuine MVar or IORef.
[...]

Post by Ertugrul SÃ¶ylemez
For some reason, people refrain from using StateT here, but prefer
to use unsafePerformIO.

I think they use they unsafePerformIO hack because it is the only way
to get a unique MVar or whatever at the top level and that is what
they need :-)

I don't think I'm misunderstanding the problem. I rather think, you're
misunderstanding my solution. =)

data MyLibCfg
= MyLibCfg {
doSomethingLock :: Maybe (MVar ()) }

type MyLibIO = StateT MyLibCfg IO

doSomething :: MyLibIO Whatever
doSomething = do
lock <- gets doSomethingLock
liftIO $ do
takeMVar lock
...
putMVar lock ()

Have a look at the CGI and the MPD libraries for practical examples of
this idea (although I've never looked into their source, the approach
seems to be the same). The only remaining problem is initializing the
lock. This is how I do it:

withMyLib :: StateT MyLibCfg IO a -> IO a
withMyLib comp = do
dsl <- newMVar ()
initMyLibInSomeWay
result <- evalStateT comp MyLibCfg { doSomethingLock = dsl }
freeMyLibInSomeWay
return result

Of course there is no (safe) way to make sure that withMyLib is not
called multiple times, but the point is, there doesn't need to be such a
guarantee.

There is one type of library, for which this approach fails: Libraries,
which need to be initialized through an explicit call to a function, but
can only be freed through exiting the process/thread. I'd say these
libraries are so badly designed tbat I wouldn't use them anyway. If for
some reason I'm forced to, I wouldn't feel as bad about the
unsafePerformIO as about using such a library in the first place. =)

Post by Ertugrul SÃ¶ylemez
It has been shown that a complete operating system can be
implemented entirely in Haskell. Have a look at House [1].
[1] http://programatica.cs.pdx.edu/House/

You mean the "Principled Approach to Operating System Construction in
Haskell"? :-)

I have actually tried it and it works fine (a bit slow though). =)

Again, perhaps you should take a look at the source. Even for this
modest OS I can count at least 8 uses of the "unsafePerformH hack" to
create "global variables". There would be even more if I included the
cbits of this entirely written in Haskell OS. Not that I would ever be
one to suggest that this is in any way unprincipled.
[...]

The point I've tried to make is that you don't _need_ to do that. In
fact, I don't know why many programmers do it. I'm sure it could be
avoided through better understanding of monads, monad transformers and
the language itself.

Some conclusions re. the suitablility of Haskell in it's current form
http://web.cecs.pdx.edu/~rebekah/papers/icfp05_h.pdf
http://web.cecs.pdx.edu/~rebekah/papers/plos07.pdf
No real surprises there, but strangely neither of these papers
mentions the dependence on "global variables" nor the reliance on an
unsound hack to create them. I would have thought a minor problem like
not even being able to rely on correct compilation might have been
worth a mention in passing :-)
It all adds to my suspicion that there is a real conspiracy of silence
about this problem in the research community, like this is the problem
that nobody wants (dares) to mention or admit exists. The silence from
ghchq on this issue is also "deafening" (as they say). I really can't
imagine why that might be.

Neither do I know, nor do I care.

I like Haskell, because it's both elegant and useful in practice for me.
If it fails for you, feel free to use something different. =)

Greets,
Ertugrul.

--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://blog.ertes.de/

Adrian Hey

2009-04-23 16:29:21 UTC

Post by Ertugrul SÃ¶ylemez

Hello Ertugrul,

Post by Ertugrul SÃ¶ylemez
I don't think I'm misunderstanding the problem. I rather think, you're
misunderstanding my solution. =)
data MyLibCfg
= MyLibCfg {
doSomethingLock :: Maybe (MVar ()) }
type MyLibIO = StateT MyLibCfg IO
doSomething :: MyLibIO Whatever
doSomething = do
lock <- gets doSomethingLock
liftIO $ do
takeMVar lock
...
putMVar lock ()
Have a look at the CGI and the MPD libraries for practical examples of
this idea (although I've never looked into their source, the approach
seems to be the same). The only remaining problem is initializing the
withMyLib :: StateT MyLibCfg IO a -> IO a
withMyLib comp = do
dsl <- newMVar ()
initMyLibInSomeWay
result <- evalStateT comp MyLibCfg { doSomethingLock = dsl }
freeMyLibInSomeWay
return result

..or something like that, if that's what turns you on :-)

I just don't know how this approach may (or may not) scale for non-toy
examples (entire OS say) in the presence of concurrency, changing
requirements, changing implementations, changing hardware, changing
sub-system dependency graphs..

But IMO the biggest problem with this is it just doesn't address the
fundamental safety problem (loss of static uniqueness guarantee and
propogation of "uniqueness responsibility" all the way up to main).
Users just shouldn't have such safety responsibilities imposed on them.
The library and associated API should be inherently safe, no matter
what, by design. "Bullet Proof" IOW.

This may be fine if I do want a pseudo global but still parameterisable
environment. But this possibility is precisely what I don't want in this
situation. The reason for making it a "global variable", albeit with
a very dodgy hack, was to prevent this spoofing possibility.

Post by Ertugrul SÃ¶ylemez
Of course there is no (safe) way to make sure that withMyLib is not
called multiple times, but the point is, there doesn't need to be such a
guarantee.

I don't know how you reached that conclusion. In this particular case
(only) I guess it would be OK if it was run multiple times sequentially.
But as we are talking about (presumably) concurrent apps I think this
statement is not true (there does need to be such a guarantee).

Post by Adrian Hey
Again, perhaps you should take a look at the source. Even for this
modest OS I can count at least 8 uses of the "unsafePerformH hack" to
create "global variables". There would be even more if I included the
cbits of this entirely written in Haskell OS. Not that I would ever be
one to suggest that this is in any way unprincipled.
[...]

The point I've tried to make is that you don't _need_ to do that. In
fact, I don't know why many programmers do it.

I suppose the House folk had their reasons. Something to do with
modularity, maintainability and safety possibly? :-) But I bet they
hated having to use the unsafePerformIO hack to do something that
frickin well should be *perfectly safe*.

This is the real dilema and Haskell designers have placed us in. What
should be the correct and safest low level IO library or module design
is often (usually in fact) unimplementable without using a very
dangerous hack.

Post by Ertugrul SÃ¶ylemez
I'm sure it could be
avoided through better understanding of monads, monad transformers and
the language itself.

I don't think it anyone doubts "global variables" could be avoided, with
or without the monadic gymnastics. The more relevant question for me is
why *should* they be avoided, given that they provide exactly the
properties that I'm looking for? Nothing else I've seen suggested as
an alternative over the years offers the same convenience and safety
for both implementors *and* users.

Because "global variables are evil they just are and will bite you one
day (tm)"? :-)

Post by Ertugrul SÃ¶ylemez
I like Haskell, because it's both elegant and useful in practice for me.
If it fails for you, feel free to use something different. =)

Don't worry, I'm already gone (have been for some time in fact). But if
this issue isn't properly addressed in the next standard, one way or
another, then a great many people might discover that Haskell fails for
them one day..
http://hackage.haskell.org/trac/ghc/ticket/1366
:-)

Besides, apart from the practical dangers of incorrect compilation, the
"unsafePerformIO hack" is just an embarrassment for a language like
Haskell.

Regards
--
Adrian Hey

Adrian Hey

2009-04-29 13:37:52 UTC

Hello Folks,

Just thought I'd say a bit about what this "unsafePerformIO hack" thing
was all about and what's wrong with it, just in case anyone was
wondering.

Post by Adrian Hey
lock :: MVar ()
lock <- newMVar () -- Hypothetical top level <- binding

..or in real broken (impure) Haskell
lock = unsafePerformIO (newMVar ())

The problem being it's simply not true that lock is "equal to"
unsafePerformIO (newMVar ()). It's obviously unsafe in the presence of
(at least) 2 common and reasonable compiler transformations, common
sub-expression elimination and inlining.

Regards
--
Adrian Hey

Jon Harrop

2009-04-22 09:36:56 UTC

Post by Ertugrul SÃ¶ylemez

There is so little software written in Haskell that it is difficult to study
but Frag and Darcs have both crashed for me. Frag crashed because the poor
performance of idiomatic Haskell forced the programmer to use unsafe hacks.

There are also serious unresolved problems with GHC's internals. Read the
latest paper on their implementation of sparks, for example.

Post by Ertugrul SÃ¶ylemez
Of course as well I might be misunderstanding your idea of modularity.
It has a mature module system,

Haskell's module system could barely be less advanced.

Post by Ertugrul SÃ¶ylemez
its own package management (Cabal), etc.

Cabal fails to install roughly half of the packages I have tried to install.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Jon Harrop

2009-04-18 20:01:03 UTC

Post by namekuseijin
I thought one of the ideas behind
Haskell and GHC was to try to achieve performance comparable to
imperative programs while remaining purely functional.

Another part of the failed experiment that was Haskell.

Haskell is not a failed experiment! Even if all that's been achieved
is a demonstration of how not to design a "purely functional"
programming language, it's still a successful experiment :-)

LOL. :-)

That is a circular argument. Nobody uses them in Haskell because they suck
(only) in Haskell. There is nothing objectively bad about hash tables. They
are a great data structure.

I was equally horrified to see Xavier Leroy claim that OCaml users do not
care about getting better GUI libraries because he asked some of the people
who use OCaml despite its lack of decent GUI libraries, i.e. a
self-selected group.

Is it not obvious that decent hash tables and GUIs libs would help?

Post by Adrian Hey
Also, I think your statement about hash tables being the *only* way to
get a performant dictionary is wrong. Tries should offer comparable
(often better) performance and have some nice properties that hash
tables don't give you.

No, tries are a very specialized data structure for keys that are sequences
and you want very specialized operations over them like searching by
Levenshtein distance. Tries offer competitive performance only when the
keys are sufficiently long sequences. In general, a decent hash table is
10x faster even when keys are short sequences, e.g. English words. Tries
are slow because they incur many costly indirections whereas (decent) hash
tables incur only one.

That is also a circular argument though because a trie is just a tree of
dictionaries and you still have to implement those dictionaries and a hash
table can be good there too. For example, an LZW compressor is likely to
use a trie where each sub-dictionary maps bytes to tries. The obvious
sub-dictionary is a 256-element array mapping bytes to tries but that is
another array of boxed values so mutations will also cripple GHC. The
nearest pure alternative is a balanced binary search tree which will
require up to 8 indirections instead of one and, consequently, will be a
lot slower.

Post by Adrian Hey
IMO by far the biggest problem with Haskell is that, for all it's
sophistication, it's still very difficult to use it to write *robust*
programs (as opposed to programs that are merely "correct" in an
academic sense). But often you can live with this, if all you're
using it for is to knock up quick and dirty "run once" programs
to get whatever answer you're looking for. As long as the program
terminates eventually (as it usually does), who cares about all those
space leaks or the sucky performance?
I suspect that most Haskell users use the language this way (as I kind
of super calculator or for private "in house" stuff). I do anyway. As
you have observed before, what might be regarded as "serious"
publicly available applications written in Haskell seem to be pretty
rare :-( (ghc,pugs,darcs,xmonad are about all I can think of off the top
of my head).

I can see why Haskell's weaknesses do not undermine such usage but I cannot
see anything that makes Haskell preferable to alternatives like OCaml, F#,
Scala and Clojure for that?

Post by Adrian Hey
The second biggest problem with Haskell is..something I dare not mention
again (it's tabboo in the Haskell community).

:-)

Post by Adrian Hey
But I think Haskellers can take heart from the knowledge that you can
always be certain you've achieved something significant when people take
the time to bash you. It's better than the fate that awaits most
experimental or DIY language implementors (being completely ignored).

True.

However, having spoken to some prominent members of the Haskell community I
think it is unreasonable to call them a community because they are all
pulling in completely different directions. Simon Peyton Jones wants a
vehicle for state-of-the-art research going forwards with no baggage. Simon
Marlow wants to avoid conflict and get on with solving academic problems.
Don Stewart wants Haskell to become mainstream without discussion. Benjamin
Russell wants people to make up their own minds based upon open discussion
and, if nothing else, learn something in the process. Suffice to say, there
are incompatible goals.

That makes it very difficult to discuss Haskell because you have to say "you
cannot expect a language that does not provide a decent hash table
implementation to be taken seriously as a general-purpose language in the
real world" to Don Stewart but clarify that "there's nothing wrong with
neglecting features that are not relevant to your research" to the Simons.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Matti Nykänen

2009-04-20 06:05:56 UTC

Post by Jon Harrop
No, tries are a very specialized data structure for keys that are sequences
and you want very specialized operations over them like searching by
Levenshtein distance. Tries offer competitive performance only when the
keys are sufficiently long sequences.

The "very specialized" and "only" in the above are a bit much. After
all, tries do lurk behind e.g. the directory of an extendible hashing
table - they are just masked as bit arithmetic on array indices.

Jon Harrop

2009-04-20 18:36:37 UTC

Post by Matti NykÃ¤nen

Post by Jon Harrop
No, tries are a very specialized data structure for keys that are
sequences and you want very specialized operations over them like
searching by Levenshtein distance. Tries offer competitive performance
only when the keys are sufficiently long sequences.

Sorry, I did not qualify that sufficiently. I was referring only to the
tries proposed as an alternative to hash tables in Haskell, i.e. purely
functional (immutable) and avoiding arrays of boxed values.

I agree that imperative tries based upon arrays or hash tables are much more
widely used and useful.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Adrian Hey

2009-04-20 12:06:47 UTC

Hello Jon,

But unless you have some guarantee that keys will be short, this is the
case you should plan for IMO. Like Stephen, I'm more concerned about
optimising worst case performance (not typical, if your lucky, with a
following wind performance :-). e.g. Assume you're going to be victim
of some devious algorithmic denial of service attack.

Post by Jon Harrop
In general, a decent hash table is
10x faster even when keys are short sequences, e.g. English words. Tries
are slow because they incur many costly indirections whereas (decent) hash
tables incur only one.

Only if there are no collisions (and there are no indirections in the
key/string representation itself).

If you google around a bit for "Generalised Tries" you can see that
Tries can be generalised for arbitrary tree like keys. Basically any
key that can be constructed from algebraic data types. I'm personally
sceptical about the performance of the (IMO naive) implementations
you'll find in most papers on the subject, but the idea is basically
a good one I think.

Your point about indirection costs is valid, but it's an unfortunate
fact of life with representations of non-trivial keys used by any
language (e.g. keys themselves may be or may contain AVL trees for
example). Even Haskells native string representation is a linked list
of chars (for better or worse).

Unless I'm missing something, a hash table will require at least 3
traversals of complex indirected key structures for a successful
lookup (assuming no collisions). 1 to compute the hash and 2 for the
final equality verification.

A Trie implementation will essentially require the 2 traversals for
equality verification + cost of traversal of the trie structure
itself. But you know you won't get any collisions.

A perfectly valid alternative approach would be to serialise keys
into some kind of compact "bit block" and implement a map on
"bit block" keys (by whatever means). But as these are not the
native representations you'd have to factor in the serialisation
cost into your performance measurements.

Regards
--
Adrian Hey

Jon Harrop

2009-04-20 19:06:48 UTC

Post by Jon Harrop
No, tries are a very specialized data structure for keys that are
sequences and you want very specialized operations over them like
searching by Levenshtein distance. Tries offer competitive performance
only when the keys are sufficiently long sequences.

But unless you have some guarantee that keys will be short, this is the
case you should plan for IMO.

Ok. I believe keys are usually short, e.g. ints and floats.

Post by Adrian Hey
Like Stephen, I'm more concerned about
optimising worst case performance (not typical, if your lucky, with a
following wind performance :-). e.g. Assume you're going to be victim
of some devious algorithmic denial of service attack.

Fair enough. I am interested in solving problems efficiently but without
trie advocates trying to hack my system. :-)

Right. I tried and failed to implement a pattern matcher for Mathematica
based upon generalized tries once.

Post by Adrian Hey
Your point about indirection costs is valid, but it's an unfortunate
fact of life with representations of non-trivial keys used by any
language (e.g. keys themselves may be or may contain AVL trees for
example). Even Haskells native string representation is a linked list
of chars (for better or worse).

True. I was referring to simple keys which I believe are most common and, in
fact, OCaml's Hashtbl.t uses only structural hashing so it will screw up if
you use keys that are trees and .NET's hash table implementation uses
physical hashing of arrays (!). There are workarounds, of course.

Post by Adrian Hey
Unless I'm missing something, a hash table will require at least 3
traversals of complex indirected key structures for a successful
lookup (assuming no collisions). 1 to compute the hash and 2 for the
final equality verification.

For Hashtbl.replace in OCaml, yes, but you can use Hashtbl.add if you know
there is no existing binding and/or you want to shadow bindings. That just
bungs the new binding it without performing any equalities.

Post by Adrian Hey
A Trie implementation will essentially require the 2 traversals for
equality verification + cost of traversal of the trie structure
itself. But you know you won't get any collisions.

Traversing the trie is the bit that kills performance.

Post by Adrian Hey
A perfectly valid alternative approach would be to serialise keys
into some kind of compact "bit block" and implement a map on
"bit block" keys (by whatever means). But as these are not the
native representations you'd have to factor in the serialisation
cost into your performance measurements.

But that's just it: that bit twiddling with keys is instantaneous compared
to the hugely expensive indirections to main memory (a big data structure).
Today, hundreds of bit twiddles that save a single random indirection will
improve performance.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Paul Rubin

2009-04-21 02:25:33 UTC

Post by Adrian Hey
e.g. Assume you're going to be victim
of some devious algorithmic denial of service attack.

Fair enough. I am interested in solving problems efficiently but without
trie advocates trying to hack my system. :-)

But the application under discussion is an internet firewall, which by
definition aims to be robust against the efforts of diabolically
fiendish attackers trying to exploit any properties of it that they
can to make it fail. Using hash tables for that kind of maliciously
concocted input set is just asking for trouble.

Jon Harrop

2009-04-21 02:42:33 UTC

Post by Adrian Hey
e.g. Assume you're going to be victim
of some devious algorithmic denial of service attack.

Fair enough. I am interested in solving problems efficiently but without
trie advocates trying to hack my system. :-)

Then you should be able to point out some vulnerabilities in the approach I
proposed. :-)

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Paul Rubin

2009-04-21 07:17:58 UTC

But the application under discussion is an internet firewall, ...

Then you should be able to point out some vulnerabilities in the approach I
proposed. :-)

Eh? You are doing a bunch of inserts into a hash table with known (or
guessable) parameters and a known hash algorithm. A bit of careful
engineering should be able to produce a pessimal key sequence.

Jon Harrop

2009-04-21 13:38:49 UTC

But the application under discussion is an internet firewall, ...

Then you should be able to point out some vulnerabilities in the approach
I proposed. :-)

Eh? You are doing a bunch of inserts into a hash table with known (or
guessable) parameters and a known hash algorithm. A bit of careful
engineering should be able to produce a pessimal key sequence.

Just quantify the overheads of that pessimal key sequence: a couple of words
in memory and a few cycles in time. Completely insignificant.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Paul Rubin

2009-04-21 14:15:49 UTC

Post by Jon Harrop
Just quantify the overheads of that pessimal key sequence: a couple of words
in memory and a few cycles in time. Completely insignificant.

I don't understand Ocaml well enough to verify that easily. How about
replacing the hash table with an array and linear search and timing it.
Also, what's a "couple of words"? What about the IPV6 case?

Larry Coleman

2009-04-20 12:56:39 UTC

That is a circular argument. Nobody uses them in Haskell because they suck
(only) in Haskell.

O.K., which is better, circular argument, or unsupported assertion?
Before you answer, please note that the argument, which isn't really
an argument, is only made circular if you accept the unsupported
assertion.

Jon Harrop

2009-04-20 18:49:06 UTC

That is a circular argument. Nobody uses them in Haskell because they
suck (only) in Haskell.

What is the unsupported assertion?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Larry Coleman

2009-04-20 20:08:28 UTC

That is a circular argument. Nobody uses them in Haskell because they
suck (only) in Haskell.

What is the unsupported assertion?
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.http://www.ffconsultancy.com/?u

That the main reason for hash tables not being used is performance.
I'm not a Haskell expert, but I've managed to get things done without
using them. You may get more information on this point on fa.haskell.

FWIW, it never occurred to me to even try using a hash table while
writing my SQL parser or the utility programs that use it. I did need
to compare two sets of table names to get a set difference, but it was
simpler just to use Sets.

Jon Harrop

2009-04-20 20:18:42 UTC

Post by Jon Harrop
What is the unsupported assertion?

That the main reason for hash tables not being used is performance.

I see. Fair enough.

Post by Larry Coleman
I'm not a Haskell expert, but I've managed to get things done without
using them. You may get more information on this point on fa.haskell.
FWIW, it never occurred to me to even try using a hash table while
writing my SQL parser or the utility programs that use it. I did need
to compare two sets of table names to get a set difference, but it was
simpler just to use Sets.

Sure. I would have used sets for that as well. Actually, I very rarely use
hash sets because mutation makes set theoretic operations incredibly slow
(e.g. in C++).

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Larry Coleman

2009-04-20 13:19:33 UTC

True.
However, having spoken to some prominent members of the Haskell community I
think it is unreasonable to call them a community because they are all
pulling in completely different directions. Simon Peyton Jones wants a
vehicle for state-of-the-art research going forwards with no baggage. Simon
Marlow wants to avoid conflict and get on with solving academic problems.
Don Stewart wants Haskell to become mainstream without discussion. Benjamin
Russell wants people to make up their own minds based upon open discussion
and, if nothing else, learn something in the process. Suffice to say, there
are incompatible goals.
That makes it very difficult to discuss Haskell because you have to say "you
cannot expect a language that does not provide a decent hash table
implementation to be taken seriously as a general-purpose language in the
real world" to Don Stewart but clarify that "there's nothing wrong with
neglecting features that are not relevant to your research" to the Simons.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.http://www.ffconsultancy.com/?u

Now these are good points.

However, I don't think any of this really makes a difference. Haskell
will never be a mainstream language, not for any reason you mention,
but because its effective use requires a change in thinking patterns.
Most mainstream developers just want to get their work done so they
can get paid and go home. Learning anything new at all is undesirable,
and thinking differently is definitely out. What's worse, the same
applies to any functional programming language as compared to the
mainstream imperative universe. So why are we even talking about this?
There's a nice static vs. dynamic flame war at c.l.l. that sounds more
interesting.

Jon Harrop

2009-04-20 18:48:26 UTC

Post by Larry Coleman
However, I don't think any of this really makes a difference. Haskell
will never be a mainstream language, not for any reason you mention,
but because its effective use requires a change in thinking patterns.
Most mainstream developers just want to get their work done so they
can get paid and go home. Learning anything new at all is undesirable,
and thinking differently is definitely out. What's worse, the same
applies to any functional programming language as compared to the
mainstream imperative universe. So why are we even talking about this?
There's a nice static vs. dynamic flame war at c.l.l. that sounds more
interesting.

That same argument was used against OOP 30 years ago and GCs 20 years ago. I
see no reason why functional programming will not follow the same trend.
Indeed, C# 3 and .NET 3.5 have already made it happen to a large extent. F#
is next...

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Larry Coleman

2009-04-20 20:10:23 UTC

I sincerely hope you're right on this. I tend to doubt it, because
both the changes you mention occurred before the commodity programmers
showed up.

Jon Harrop

2009-04-20 20:44:11 UTC

Post by Larry Coleman
I sincerely hope you're right on this. I tend to doubt it, because
both the changes you mention occurred before the commodity programmers
showed up.

Well, I think it already happened. The .NET world is already using
monadic-style LINQ and functional-style WPF in the real world. Uptake is
slow:

http://www.google.com/trends?q=wpf

but I've no doubt it will supercede the old WinForms style of programming
before long.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Ertugrul Söylemez

2009-04-20 22:31:34 UTC

Post by Larry Coleman
However, I don't think any of this really makes a
difference. Haskell will never be a mainstream language, not for
any reason you mention, but because its effective use requires a
change in thinking patterns. Most mainstream developers just want
to get their work done so they can get paid and go home. Learning
anything new at all is undesirable, and thinking differently is
definitely out. What's worse, the same applies to any functional
programming language as compared to the mainstream imperative
universe. So why are we even talking about this? There's a nice
static vs. dynamic flame war at c.l.l. that sounds more
interesting.

That same argument was used against OOP 30 years ago and GCs 20
years ago. I see no reason why functional programming will not
follow the same trend. Indeed, C# 3 and .NET 3.5 have already made
it happen to a large extent. F# is next...

I sincerely hope you're right on this. I tend to doubt it, because
both the changes you mention occurred before the commodity programmers
showed up.

Consider that even Visual Basic supports closures today, even though
they are a PITA to use. However, VB itself is a PITA anyway.

Greets,
Ertugrul.

--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://blog.ertes.de/

Adrian Hey

2009-05-01 11:06:35 UTC

Hello Folks,

I see this thread has made it to reddit thanks to JH and also to
someones blog:

http://fpmatters.blogspot.com/

Who is fpmatters anyway? Is this JH too I wonder? :-)

Post by Adrian Hey
Haskell is not a failed experiment! Even if all that's been achieved
is a demonstration of how not to design a "purely functional"
programming language, it's still a successful experiment :-)

and in the above mentioned blog we have..

"Thank you, Adrian, for your candor. Characterizing Haskell as a
sophisticated academic calculator and an experiment, whose main value is
in teaching us how not to design languages, certainly wouldn't sell many
books."

My remark was intended to be humorous and was qualified with *even if*.
I just want to state for the record that it's not my opinion that
Haskells "main value is in teaching us how not to design languages".
There's a lot any language designers could learn from Haskell, both
good and bad.

Haskell is a fine language for typical PC or workstation environments
with shed loads of ram and used for applications that *do* terminate.
Especially if your primary concern is *correctness*, rather than
performance (or even robustness). E.G. typical file munger, including
compilers for even better languages :-)

But anyone looking for utmost reliability from programs that run
"forever" should be wary of it I think. It's quite difficult to be
confident that a Haskell program can be contained in bounded memory
indefinitely.

My 2p..

Regards
--
Adrian Hey

Larry Coleman

2009-05-01 12:57:53 UTC

Post by Adrian Hey
Hello Folks,
I see this thread has made it to reddit thanks to JH and also to
http://fpmatters.blogspot.com/
Who is fpmatters anyway? Is this JH too I wonder? :-)

I don't think fpmatters is JH because fpmatters actually has some nice
things to say about Haskell in other posts. ;-)

Adrian Hey

2009-05-01 14:51:44 UTC

Hello Larry,

Post by Larry Coleman
I don't think fpmatters is JH because fpmatters actually has some nice
things to say about Haskell in other posts. ;-)

I think your right. I see I've also been flatteringly described a being
a "prominent member of the FPL community". Dunno where he got that idea
from :-)

Regards
--
Adrian Hey

Jon Harrop

2009-05-02 00:48:49 UTC

Post by Adrian Hey
Hello Larry,

Post by Larry Coleman
I don't think fpmatters is JH because fpmatters actually has some nice
things to say about Haskell in other posts. ;-)

I think your right. I see I've also been flatteringly described a being
a "prominent member of the FPL community". Dunno where he got that idea
from :-)

That blog certainly isn't me. For example, they list what they believe are
important practical aspects for future functional languages and I find
myself strongly disagreeing:

http://fpmatters.blogspot.com/2009/04/what-practical-modern-fpl-ought-to-be.html

"Syntax: Haskell-like, with significant white space. This is where Haskell
and Clean get it right."

Indentation-sensitive syntax is for visually impaired conference delegates.
Programmers want code pasted from a website to Just Work. In OCaml, you
autoindent with ALT-Q in Emacs. In Haskell (and F#), you are completely
screwed and have to reindent the entire code by hand and hope that you get
it right or you'll change the semantics of the program. If you want
brevity, use inference more as OCaml does compared to Haskell and F#.

"Macros: the need for macros is diminished in languages that can be lazy.
I don't believe a language should be designed around macros, like Lisp is.
Programming in Haskell, I rarely feel the need for macros, if at all. As a
classic example, || can be defined as a function in it."

I don't want to have to use laziness as an inefficient workaround for a lack
of decent tools. Camlp4 kicks ass. You'll never get decent metatools from
Microsoft, of course, because they undermine lock-in.

"Evaluation: Functions should be strict by default, optionally lazy.
Default data structures: lazy by default. This way, you get the best of
both worlds. This is where Clojure gets it right, and almost no one else."

I agree with strict functions by default but I don't care for lazy data
structures except for lazy lists.

"Runtime: JVM. JVM is very fast these days, and it will only get faster.
Some people will bring up the lack of full tail-call optimization in JVM,
but it's coming soon. JVM can do "unthinkable" feats like inlining virtual
function calls across separate compilation units and even code generated at
runtime. Not having to develop your own runtime is big. Having easy access
to huge libraries is important too. And the fundamental difference between
calling C from Haskell and, say, Java from Clojure is that the latter is
memory-safe, even if you screw up."

If you use parametric polymorphism or multiple return values, the JVM is
slower than my grandma. Moreover, it does not even have tail calls which
are an absolutely essential feature of any FPL. According to the guys at
Wolfram Research, they would be building upon the JVM if its FFI weren't so
grindingly slow.

"MLs: strict and statically typed; but everything else is not so good"

I unclog my nose in your general direction.

OCaml is a great language with great libraries and tools and is still
rapidly evolving, e.g. the batteries included and parallel GC projects are
nearing stable release.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

George Neuner

2009-05-02 03:15:39 UTC

Post by Jon Harrop
That blog certainly isn't me. For example, they list what they believe are
important practical aspects for future functional languages and I find
http://fpmatters.blogspot.com/2009/04/what-practical-modern-fpl-ought-to-be.html
"Macros: the need for macros is diminished in languages that can be lazy.
I don't believe a language should be designed around macros, like Lisp is.
Programming in Haskell, I rarely feel the need for macros, if at all. As a
classic example, || can be defined as a function in it."

Just for clarification, you can also define "or" and "if" as functions
in Lisp by wrapping the arguments in anonymous lambdas. The code just
looks a little messy which is why it is packaged in a macro.

Lisp macros can be used for defining control flow because they don't
evaluate their arguments. However, most programmers don't use them
that way. Lisp macros are used mostly for templated code generation
where the programmer has identified a repetitive pattern of coding.

George

namekuseijin

2009-05-03 07:06:02 UTC

Post by Adrian Hey
I see this thread has made it to reddit thanks to JH and also to
http://fpmatters.blogspot.com/

Well, it has some amusing polls. For instance, Favorite Programming
Language goes to Haskell, Static Typing seems to be "the only way to
fly"... but Best Syntax is "S-Expressions"?! If there are so many
Lisp fans there, why are the other categories not reflecting it? :P

Marco Antoniotti

2009-05-03 11:14:10 UTC

Post by Adrian Hey
I see this thread has made it to reddit thanks to JH and also to
http://fpmatters.blogspot.com/

I have just read the above, but I am not surprised. I find myself
fitting these comments (am I going mainstream?). Haskell is good (I
find it less constraining than OCaml and ML things). Static typing is
good (when it does not get in the way of programming). S-expressions
are good, but they cannot beat a full blown XML-based programming
language. :)

Cheers
--
Marco
ELS 2009 www.european-lisp-symposium.org

Slobodan Blazeski

2009-05-03 12:19:45 UTC

Post by Adrian Hey
I see this thread has made it to reddit thanks to JH and also to
http://fpmatters.blogspot.com/

1. Favorite FPL:
Cl programmers don't call cl functional, but multi-paradigm
language. I've skipped voting.
2. Monads are
Good thing under certain very limited circumstances. There is no such
option so I've skipped voting.
3. Static typing is
Sometimes good, sometimes bad, depends of the problem you're solving,
Since there is no middle answer I've skipped voting
4. Best syntax is
S-expressions definitely. The only one I voted for.

cheers
Bobi
http://slobodanblazeski.blogspot.com/

namekuseijin

2009-05-03 17:36:24 UTC

Post by Slobodan Blazeski

Post by Adrian Hey
I see this thread has made it to reddit thanks to JH and also to
http://fpmatters.blogspot.com/

Cl programmers don't call cl functional, but multi-paradigm
language. I've skipped voting.
2. Monads are
Good thing under certain very limited circumstances. There is no such
option so I've skipped voting.
3. Static typing is
Sometimes good, sometimes bad, depends of the problem you're solving,
Since there is no middle answer I've skipped voting
4. Best syntax is
S-expressions definitely. The only one I voted for.

Now I understand why it's a bad idea to not vote in presidential
elections. ;)

Slobodan Blazeski

2009-05-04 11:59:36 UTC

Post by Adrian Hey
I see this thread has made it to reddit thanks to JH and also to
http://fpmatters.blogspot.com/

Now I understand why it's a bad idea to not vote in presidential
elections. ;)

How would you vote if you were cl programmer?

bobi

Marek Kubica

2009-05-04 22:02:22 UTC

On Sun, 3 May 2009 05:19:45 -0700 (PDT)

Post by Slobodan Blazeski
3. Static typing is
Sometimes good, sometimes bad, depends of the problem you're solving,
Since there is no middle answer I've skipped voting

It also depends on the type system. There are constraining type systems
like Java and enabling type systems like the one in Haskell. I wasn't
sure what to vote there either.

regards,
Marek

Scott Burson

2009-05-03 17:20:41 UTC

Post by namekuseijin
Favorite Programming
Language goes to Haskell, Static Typing seems to be "the only way to
fly"... but Best Syntax is "S-Expressions"?! If there are so many
Lisp fans there, why are the other categories not reflecting it? :P

Maybe Liskell is getting popular???

-- Scott

Boris Borcic

2009-05-06 10:32:58 UTC

Speaking of idioms, the psyco CPython accelerator used with the python example
in fact does not understand/optimize generators, and a quick test with py2.5 on
windows shows that removing it from your one-liner scales running time *down* to 75%

Rewriting the loop (without generator nor acceleration) puts that factor at 43%,
and using psyco on the rewritten loop brings it to 29%.

Which makes Python nearly twice as fast as *Ocaml" on this test, and only a bit
more than two times slower than F#.

The code, FYI

from timeit import timer
t = Timer('f()','''

import psyco
psyco.full()
def f() :
d={}
for i in xrange(10000000) :
d[i]=i
print d[100]''')

t.timeit(1)

Cheers,

BB

Boris Borcic

2009-05-06 14:27:32 UTC

Post by Boris Borcic
Which makes Python nearly twice as fast as *Ocaml" on this test, and
only a bit more than two times slower than F#.

An interesting bit is given by the link at

http://www.codeplex.com/IronPython/Wiki/View.aspx?title=IronPython%20Performance&referringTitle=Home

leading to various performance comparisons between the C- and the .NET
implementations of Python.

The interesting bit is that CPython is significantly faster than IronPython (eg
.NET) as far as dicts (hashes) with integer or string keys are concerned.

Cheers,

BB

namekuseijin

2009-05-06 17:45:44 UTC

Now that is amusing. :)

Post by Boris Borcic

Speaking of idioms, the psyco CPython accelerator used with the python
example in fact does not understand/optimize generators, and a quick
test with py2.5 on windows shows that removing it from your one-liner
scales running time *down* to 75%
Rewriting the loop (without generator nor acceleration) puts that factor
at 43%, and using psyco on the rewritten loop brings it to 29%.
Which makes Python nearly twice as fast as *Ocaml" on this test, and
only a bit more than two times slower than F#.
The code, FYI

from timeit import timer
t = Timer('f()','''

import psyco
psyco.full()
d={}
d[i]=i
print d[100]''')