Tuesday, January 6, 2009

The Linguists Doth Protest too Much...

I am wondering why the linguists are all in a snit over Paul Payack. Okay. They don't agree with him that there will be a million words (or that they're even countable, the way I understand it) by April, or whenever Payack's deadline is. It is quite clear that Payack is no linguist and not an expert on words. Therefore, so what? Forget about him. Don't buy his books, read the articles that are written about him or write about him. We get Smithsonian, and I plan to write them about their stupid article on Payack. It was poorly researched and poorly written. Why, though, do the linguists keep the conversation going? Language Log is just as bad as anyone in keeping this going.

And, to be honest, why in the world do the linguists think it's impossible to count the number of words we have in English? You'd have to have a specific definition (for example, the number of words in the OED, or something similar), but I'd surely think you could. Of course the alogorithm idea is ridiculous (it sounds a bit like an ex-poster of ours!), and as Dr. Benway says in that link above, Mr. Payack has no concept as to what an algorithm is. That alone should make people ignore Payack and get on with real linguistic discussions. However, the more the linguists protest, the more we will hear about this ridiculous "algorithm."

I do wonder, as did arnie on Wordcraft, whether Payack actually graduated from Harvard. If so, with what degree?

11 comments:

goofy said...

I think one reason LL keeps talking about it is they want to correct misunderstandings about language. Journalists might find the LL post and consider it before they write about the GLM.

You probably could count all the words in English if you had a very rigorous definition of "word", but any such definition would be arbitrary, no two people would agree on it, and so it would render the whole exercise meaningless. Altho it's meaningless already.

Kalleh said...

I hear what you are saying, goofy, about Payack. However, it's like parents who make a big deal when their kids act out. Ignore them, and it will stop. Make a big deal of it, and they will continue to annoy.

As for the counting of words, I do think that could be done in a meaningful way, but then who am I to know. However, we put an awful lot of money into studying far less important things.

seanahan said...

It actually shouldn't be that difficult at all to count the number of words in the English language by simple computational methods. Take, for example, all the articles in the New York Times for a year and count the number of unique words, adjusting for morphological derivations. One can judge the number of words by a variety of statistical methods, including Good-Turing. Also, English word frequencies tend to fit pretty well with Zipf's law.

Of course, new words are being coined constantly, and old words being dropped, so one would have to adjust for words which are only ever used once, or words so rare that even when used they are not understood, but that doesn't mean a good effort can't be made.

goofy said...

seahan, I really don't think it's that easy...

http://www.slate.com/id/2139611

seanahan said...

I mean, I've read quite a bit into this topic, and one of the main issues is that we have to define "word", which is not an easy thing. However, we can come up with a simple definition, or even do something like create a parameterized definition (i.e., occurs N times in last M years), and then run the code on the New York Times, blogs, or various combinations thereof, and then have a range of numbers for the size of the English vocabulary.

goofy said...

I suppose you could do that, but what would be the point? It would measure the number of "words" in the NY Times, blogs, or various combinations thereof, but I'm not convinced that it would actually measure the size of the English vocabulary.

seanahan said...

Is there a better proposal you have in mind? My method is to measure the number of words in written English, which is not necessarily what might be wanted. If we wanted to measure the number of words in spoken English, we could do something likes words used in newscasts, or in telephone conversations, or some other variety of easily recordable speech. I think written and spoken English will have differences, but I don't know that they're significant.

goofy said...

As I see it, the problem is that even if you can agree on what a "word" is (is run, ran, running 3 words or 1, and do the noun running and the verb running count as 2 words or 1?), then you have to agree on what "English" is. Sheidlower gives some examples. Is "jail" the same word as "gaol"? What about regional variants, archaic words, and of course numbers? You could count numbers forever. And then you have World Englishes: Hinglish, Singlish etc. Do they count? The idea of counting words in a languages seem pointless.

seanahan said...

I guess if you think the whole thing is pointless then it doesn't really matter what I propose. I don't think it's entirely meaningless. Measuring the growth of vocabulary of languages over time could have interesting implications linguistically. Alternatively, if you want a machine to pass the "Turing Test", it needs to understand English, and thus needs to have some idea of what words are in English.

goofy said...

I think it's pointless because I don't think people can agree on definitions of "word" and "English" for long enough to count them all. For instance, you think that you can count all the words in English by analyzing newspapers and blogs, and I disagree.

And, even if you could do it, I'm not sure what it would achieve. Programming a machine to use English for the Turing test doesn't require counting all the words in English, it just requires teaching the machine the vocabulary it needs.

Kalleh said...

Thank you so much, goofy and sean, for your thoughts.

Given an operational definition of a word, I do think we could accurately count English words. But, as goofy says, what would be the point because not everyone would agree with that definition?