Thursday, October 30, 2008

Vote for Obama, No on California Prop 8

I'm not a US citizen, so I can't vote in US elections (though I pay taxes like the best of them!), but my wife is. We have two signs in our front yard this week. One simply says "Obama", the other "No on Prop 8". I'd like to put up the equivalent of those signs in my blog.

If you're eligible to vote in the US, do me and the rest of the world a favor and vote for Obama. If you're voting in California, please help to reject Proposition 8, which is a blatant curtailing of freedom, equality and happiness for gay Californians. Defeating Prop 8 in California will pave the way for equal rights in other states and eventually all of the US. Don't make me ashamed of living here!

PS. If you think you have no time to vote, think again.

Wednesday, October 29, 2008

What makes me feel good

Whenever I feel down, I look at the TIOBE programming community index, and it makes me feel better. :-)

Monday, October 27, 2008

Questions Answered

I have now answered the top 20 questions in my section of "Ask a Google Engineer". Many of the remaining ones sound inappropriate or unanswerable, so I don't expect I'll be answering them, unless the popular vote really brings some of them to the top.

Sunday, October 26, 2008

Why explicit self has to stay

Bruce Eckel has blogged about a proposal to remove 'self' from the formal parameter list of methods. I'm going to explain why this proposal can't fly.

Bruce's Proposal

Bruce understands that we still need a way to distinguish references to instance variables from references to other variables, so he proposes to make 'self' a keyword instead. Consider a typical class with one method, for example:
class C:
def meth(self, arg):
self.val = arg
return self.val
Under Bruce's proposal this would become:
class C:
def meth(arg): # Look ma, no self!
self.val = arg
return self.val
That's a saving of 6 characters per method. However, I don't believe Bruce proposes this so that he has to type less. I think he's more concerned about the time wasted by programmers (presumably coming from other languages) where the 'self' parameter doesn't need to be specified, and who occasionally forget it (even though they know better -- habit is a powerful force). It's true that omitting 'self' from the parameter list tends to lead to more obscure error messages than forgetting to type 'self.' in front of an instance variable or method reference. Perhaps even worse (as Bruce mentions) is the error message you get when the method is declared correctly but the call has the wrong number of arguments, like in this example given by Bruce:
Traceback (most recent call last):
File "classes.py", line 9, in
obj.m2(1)
TypeError: m2() takes exactly 3 arguments (2 given)
I agree that this is confusing, but I would rather fix this error message without changing the language.

Why Bruce's Proposal Can't Work

Let me first bring up a few typical arguments that are brought in against Bruce's proposal.

There's a pretty good argument to make that requiring explicit 'self' in the parameter list reinforces the theoretical equivalency between these two ways of calling a method, given that 'foo' is an instance of 'C':
foo.meth(arg) == C.meth(foo, arg)


Another argument for keeping explicit 'self' in the parameter list is the ability to dynamically modify a class by poking a function into it, which creates a corresponding method. For example, we could create a class that is completely equivalent to 'C' above as follows:
# Define an empty class:
class C:
pass

# Define a global function:
def meth(myself, arg):
myself.val = arg
return myself.val

# Poke the method into the class:
C.meth = meth
Note that I renamed the 'self' parameter to 'myself' to emphasize that (syntactically) we're not defining a method here. Now instances of C have a method with one argument named 'meth' that works exactly as before. It even works for instances of C that were created before the method was poked into the class.

I suppose that Bruce doesn't particularly care about the former equivalency. I agree that it's more of theoretical importance. The only exception I can think of is the old idiom for calling a super method. However, this idiom is pretty error-prone (exactly due to the requirement to explicitly pass 'self'), and that's why in Python 3000 I'm recommending the use of 'super()' in all cases.

Bruce can probably think of a way to make the second equivalency work -- there are some use cases where this is really important. I don't know how much time Bruce spent thinking about how to implement his proposal, but I suppose he is thinking along the lines of automatically adding an extra formal parameter named 'self' to all methods defined directly inside a class (I have to add 'directly' so that functions nested inside methods are exempted from this automatism). This way the first equivalency can be made to hold still.

However, there's one situation that I don't think Bruce can fix without adding some kind of ESP to the compiler: decorators. This I believe is the ultimate downfall of Bruce's proposal.

When a method definition is decorated, we don't know whether to automatically give it a 'self' parameter or not: the decorator could turn the function into a static method (which has no 'self'), or a class method (which has a funny kind of self that refers to a class instead of an instance), or it could do something completely different (it's trivial to write a decorator that implements '@classmethod' or '@staticmethod' in pure Python). There's no way without knowing what the decorator does whether to endow the method being defined with an implicit 'self' argument or not.

I reject hacks like special-casing '@classmethod' and '@staticmethod'. I also don't think it would be a good idea to automagically decide whether something is supposed to be a class method, instance method, or static method from inspection of the body alone (as someone proposed in the comments on Bruce's proposal): this makes it harder to tell how it should be called from the 'def' heading alone.

In the comments I saw some pretty extreme proposals to save Bruce's proposal, but generally at the cost of making the rules harder to follow, or requiring deeper changes elsewhere to the language -- making it infinitely harder to accept the proposal as something we could do in Python 3.1. For 3.1, by the way, the rule will be once again that new features are only acceptable if they remain backwards compatible.

The one proposal that has something going for it (and which can trivially be made backwards compatible) is to simply accept
def self.foo(arg): ...

inside a class as syntactic sugar for
def foo(self, arg): ...

I see no reason with this proposal to make 'self' a reserved word or to require that the prefix name be exactly 'self'. It would be easy enough to allow this for class methods as well:
@classmethod
def cls.foo(arg): ...
Now, I'm not saying that I like this better than the status quo. But I like it a lot better than Bruce's proposal or the more extreme proposals brought up in the comments to his blog, and it has the great advantage that it is backward compatible, and can be evolved into a PEP with a reference implementation without too much effort. (I think Bruce would have found out the flaws in his own proposal if he had actually gone through the effort of writing a solid PEP for it or trying to implement it.)

I could go on more, but it's a nice sunny Sunday morning, and I have other plans... :-)

Wednesday, October 22, 2008

Sorting a million 32-bit integers in 2MB of RAM using Python

Someone jokingly asked me how I would sort a million 32-bit integers in 2 megabytes of RAM, using Python. Taking up the challenge, I leared something about buffered I/O.

Obviously this is a joke question -- the data alone would take up 4 megabytes, assuming binary encoding! But there's a possible interpretation: given a file containing a million 32-bit integers, how would you sort them with minimal memory usage? This would have to be some kind of merge sort, where small chunks of the data are sorted in memory and written to a temporary file, and then the temporary files are merged into the eventual output area.

Here is my solution. I'll annotate it in a minute.

NOTE: All my examples use Python 3.0. The main difference in this case is the use of file.buffer to access the binary stream underlying the text stream file.

#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4

def intsfromfile(f):
while True:
a = array.array('i')
a.fromstring(f.read(4000))
if not a:
break
for x in a:
yield x

iters = []
while True:
a = array.array('i')
a.fromstring(sys.stdin.buffer.read(40000))
if not a:
break
f = tempfile.TemporaryFile()
array.array('i', sorted(a)).tofile(f)
f.seek(0)
iters.append(intsfromfile(f))

a = array.array('i')
for x in heapq.merge(*iters):
a.append(x)
if len(a) >= 1000:
a.tofile(sys.stdout.buffer)
del a[:]
if a:
a.tofile(sys.stdout.buffer)
On my Google desktop (a 3 year old PC running a Googlified Linux, rating about 34000 Python 3.0 pystones) this took about 5.4 seconds to run, with an input file containing exactly 1,000,000 32-bit random integers. That's not so bad, given that a straightforward in-memory sort of the same input took about 2 seconds:
#!/usr/bin/env python3.0
import sys, array
a = array.array('i', sys.stdin.buffer.read())
a = list(a)
a.sort()
a = array.array('i', a)
a.tofile(sys.stdout.buffer)
Back to the merge-sort solution. The first three lines are obvious:
#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4
The first line says we're using Python 3.0. The second line imports the modules we're going to need. The third line here makes it break on those 64-bit systems where the 'i' typecode doesn't represent a 32-bit int; I am making no attempts to write this code portably.

Then we define a little helper that is a generator which reads integers from a file and yields them one at a time:
def intsfromfile(f):
while True:
a = array.array('i')
a.fromstring(f.read(4000))
if not a:
break
for x in a:
yield x
This is where the performance tuning of the algorithm takes place: it reads up to 1000 integers at a time, and yields them one by one. I had originally written this without buffering -- it would just read 4 bytes from the file, convert them to an integer, and yield the result. But that ran about 4 times as slow! Note that we can't use a.fromfile(f, 1000) because the fromfile() method complains bitterly when there aren't enough values in the file, and I want the code to adapt automatically to however many integers are on the file. (It turns out we write about 10,000 integers to a typical temp file.)

Next we have the input loop. This repeatedly reads a chunk of 10,000 integers from the input file, sorts them in memory, and writes them to a temporary file. We then add an iterator over that temporary file, using the above intsfromfile() function, to a list of iterators that we'll use in the subsequent merge phase.

iters = []
while True:
a = array.array('i')
a.fromstring(sys.stdin.buffer.read(40000))
if not a:
break
f = tempfile.TemporaryFile()
array.array('i', sorted(a)).tofile(f)
f.seek(0)
iters.append(intsfromfile(f))
Note that for an input containing a million values, this creates 100 temporary files each containing 10,000 values.

Finally we merge all these files (each of which is sorted) together. The heapq module has a really nice function for this purpose: heapq.merge(iter1, iter2, ...) returns an iterator that yields the input values in order, assuming each input itself yields its values in order (as is the case here).

a = array.array('i')
for x in heapq.merge(*iters):
a.append(x)
if len(a) >= 1000:
a.tofile(sys.stdout.buffer)
del a[:]
if a:
a.tofile(sys.stdout.buffer)
This is another place where buffered I/O turned out to be essential: Writing each individual value to a file as soon as it is available slows the algorithm down about twofold. Thus, by simply adding input and output buffering, we gained a tenfold speed-up!

And that's the main moral of the story.

Another lesson is praise for the heapq module, which contains the iterator merge functionality needed in the output phase. Also, let's not forget the utility of the array module for managing binary data.

And finally, let this example remind you that Python 3.0 is notso different from Python 2.5!

Friday, October 10, 2008

About This Blog

You may wonder why I started a new blog, where I already have a perfectly fine blog on Artima. I am currently wondering too, since somehow my very first entry was flagged as spam by the helpful Blogger spam detection software, and now claims I have to request a manual review to prove I'm not a spammer. I'm guessing that I either used too many long, difficult words referring to schools of philosophy -- I've seen splogs containing long strings of text copied from philosophy textbooks. Or perhaps I quoted too many nonsense words from Anthem. Anyway, I've requested two reviews so far, but still nada. Fingers crossed.

As for the reason to start a new blog, the Artima blog is meant to be technical stuff about Python -- Artima in general is a site about OO programming, and I wanted to write about other stuff that interests me. I have a slew of topics lined up: other books I've read, more about my views on consciousness (with a lowercase 'c' :-), and a silly idea I call "planet of the nerds": what if Captain Kirk ran into a planet whose entire population was made up of nerds with IQs of 250 and no social skills? Would such a society even survive?

Anyway, this is just a test -- Blogger claims I cannot make new posts. Let's see if that's true!

Sunday, October 5, 2008

Thoughts after reading Neal Stephenson's Anathem

I just finished reading Neal Stephenson's new novel, Anathem. I don't know where to start: it's a fascinating story, if occasionally slow, but that seems unavoidable given the deep ideas Stephensons expose. I'm not an avid Science Fiction reader, though I don't avoid the genre. There's just so much else to read about! (And I'm a slow reader, especially in English, which is not my first language.) But I like Stephenson's work, which I first encountered with Cryptonomicon. I ate up the Baroque Cycle, and when Stephenson showed up at Google a few weeks ago to discuss Anathem, I tuned in from the SF office. Hearing him talk about his ideas made me want to read the book even more -- but I decided to do some "homework" first, and bought Snow Crash in the same Amazon order as Anathem, and read it first.

(Warning: mild spoilers may be ahead.)

Snow Crash is fascinating, but it rings a bit hollow -- in the end I felt like it was mostly a superbly crafted fantastic adventure story. The "deep ideas" in that book felt a bit fake: the technical details of how the Metaverse works make little sense from what I know about virtual reality systems (this was corroborated by a friend who worked in the game industry for 25 years), and the idea of an "ur-language" that is like the low-level software of our brains strikes me as nonsense from what I've read about the functioning of our brains (see note above :-). The idea of a virus or anti-virus that can be fed into our brains by appealing directly to this low-level software of hackers' brains then becomes rather bizarre, and the idea of feeding this same information into non-hackers' brains via the blood utter nonsense. So then we're left with a great, dark view of a possible future where many phenomena of modern society, from the mafia's code of honor to suburbs full of bored teenagers, have become fantastically exaggerated, and a lot of cool wish-it-could-be technology, from the Metaverse to skateboards and motorcycles with "smart wheels" (why doesn't the Wikipedia article mention those?). I find it hard to believe that there are many women who would enjoy this book, it's the ultimate testosterone fantasy.

Fast-forward to Anathem. Stephenson has gotten a lot better. It's more likely that women might appreciate this book (and not just because in the end the couple "get" each other and clearly will live happily ever after). As long as they're into philosophy and not afraid of long detailed explanations of space technology. Ok, so maybe it's still mostly a guy's thing (please prove me wrong!) but it's a lot less testosterone-laden, and the technology and "deep ideas" presented are a lot more realistic.

I find it interesting that in Snow Crash the (near) future has gotten completely out of hand, while in Anathem a world is depicted that develops more or less cyclically for thousands of years: after the high point of technology (roughly, today's age) things stay more or less the same, society and prosperity going down and up cyclically without much news being added. Stephenson's explanation for this is that most people prefer to deal with technology they can understand and tinker with, like internal combustion engines, rather than the more advanced space-age stuff that most Sci-Fi authors (including Stephenson, see Snow Crash) love to make up. Although nobody seems to object to the ubiquity of cell phones connected to the Internet -- they are apparently too useful (for society, or for the plot) to ban.

I loved the puzzles for the reader, small and large, that Stephenson spreads throughout the book. I recall that in Cryptonomicon he often starts a new chapter with a minute, detailed description of an everyday phenomenon from an unexpected vantage point or using precise scientific language, which often made me chuckle. There is some of that in Anathem, but the main class of puzzles is the introduction of new vocabulary (ostensibly because the world depicted is not Earth) for familiar things. E.g. ark for church, drummon, fetch and mobe for freight truck, pick-up truck and car, syntactic device or syndev for computer, praxis for engineering, Andrakhonic theorem for Pythagorean theorem, and Gardan’s Steelyard for Occam’s Razor. (I found one place where Stephenson apparently forgot to apply global substitution and used the word "computer-generated" instead of syndev-generated.) In general the new words are remarkably easy to remember once you've figured it out -- sometimes it helps to know Latin or French (for example fraa and suur are clearly from frère and soeur).

The only stuff that's hard is the names for the various philosophical schools -- probably because I'm not really all that well-schooled in this. I can recognize the references to Plato, but that's about it. There are probably tons of references to philosophic schools of the past two centuries, but I don't know my Nietzche from my Kant, and I'm not all that well-versed in interpretations of quantum mechanics. Except I did recognize the idea of consciousness as a quantum computer as originating with Roger Penrose -- I read his book long ago, and decided I simply disagreed. The whole idea of merging together world lines in such a way that the history of one world appears in the present of another sounds like a Sci-Fi device of the same nature as travel faster than light: useful as a plot device, but not realistic.

Yes, of course it's fiction! But Stephenson really does seem to try to make everything else realistic -- the ideas he presents have roots in real science, and so does much of the technology, from the "long now" clock to the electrodynamic tether propulsion device (both mentioned in the acknowledgements). So I guess at this point we (Stephenson and I) simply disagree on the correct interpretation of quantum mechanics, which is not something to be particularly worried about, as it's all highly speculative in the real scientific world too.

When we're talking about the nature of consciousness, I'm definitely in the camp of Douglas Hofstadter, who makes (what I think is) an eloquent point that it's just a bunch of extremely sophisticated software running on highly developed hardware, both evolved over a very long time, starting with the earliest animals that were capable of making decisions based on sensory input. I recently bought and read Hofstadter's I am a Strange Loop and loved most of it, although I had to skip some of the sentimental stuff; it made me pull out Gödel, Escher Bach and The Mind's I (both of which I'd read when I was about half my current age) and re-read much of them, enjoying them just as much as the first time (and probably understanding a lot more).

My only quibble with Hofstadter is his insistence on the importance of self-referentiality of consciousness. While I appreciate the relevance of introspection, and love the idea that the algorithms for searching our (various kinds of) memory are derived from the pattern matching algorithms originating in our sensory processing (which clearly was a precursor of consciousness), I don't believe that in order to be called "intelligent" a piece of software would necessarily have to have a concept of "I" that resembles that of a person. (And no, I don't think we're anywhere near developing such intelligent software yet -- we don't know enough about the most basic stuff like perception. For that matter, I think that the whole "Singularity" idea promoted by Ray Kurzweil is bulshytt, as the inhabitants of Stephenson's world would say.)

But I do agree with Hofstadter that there is no a priori reason why sufficiently complex (presumably layered) software couldn't be developed that has representations of real-world concepts that can be queried and updated and connected in ways very similar to the way this apparently happens in our brains. Personally, I think that one of the keys to the development of such software is better understanding of our preceptional mechanisms. Which is a long-winded segue into the work of Oliver Sacks, whose works I've also been reading recently. (I am about to dive into Musicophilia, his latest work.) Sacks writes yarns that are as accessible (or more) as Stephenson's, and almost as exciting (for me anyways), with the added value that they are science, not fiction.

I wish I could work in a reference to Richard Dawkins, whose ideals on evolution also seem highly relevant. (Even though to a 3rd generation atheist like myself his rant against organized religion seems a little over the top.) But I'll save that for later.