Or, Why it's harder than you might think to teach a computer Pig Latin
One of the tasks that falls to anyone teaching an introductory course in linguistics and/or phonetics is that of disabusing the students of the notion that vowels and consonants are types of letters. Vowels and consonants are types of sounds; letters are marks on the page (or on the computer screen, or on the lavatory wall, etc., etc.). Letters typically stand for sounds, though not always—for example, the letter P stands for 'copyright' on an audio recording, for Northern Ontario at the beginning of a Canadian postal code, and for nothing whatsoever in the word ptarmigan. So we have to explain to the students that, rather than saying that a letter such as y is "sometimes a vowel," it is more accurate to say that y in English orthography is used to represent various sounds, some of which are vowels and some of which are consonants. And we teach them to use sets of phonetic symbols, such as the International Phonetic Alphabet, that are designed for representing specific speech sounds without the ambiguity that so often arises in the orthography of many languages, especially English. So, the letter y represents the diphthong /aj/ in cry, the vowel /ɪ/ in crypt, the vowel /i/ (or /ɪ/ in some dialects) in city, and the consonant /j/ in yes.
Why make such a big deal about this distinction? Well, in natural language, sounds matter a great deal. For example, the indefinite article in English has two different forms, a and an, and their distribution depends on whether the sound (not the letter) immediately after the article is a consonant or a vowel. Speakers of English say not only an owl and a youth, but also an hour and a union, and if linguists want to make any sense of what is going on, we must think in terms of sounds, not spellings.
Another place where we can see that it is sounds that matter is in language games, such as Pig Latin, that involve altering the forms of words in various systematic ways. The basic rule of Pig Latin is that you take the initial consonant or consonant cluster from the beginning of a word, move it to the end, and add -ay (IPA /ej/). As has been explored systematically in recent work by people such as Jessica Barlow, Bert Vaux, Andrew Nevins, Bill Idsardi, and Eric Raimy, there's some variation in exactly what people do when the word begins with more than one consonant (e.g., street in my dialect of Pig Latin is eetstray, but some people say treetsay, reetstay, or even reetstray), or when there is no consonant at the beginning. There's also variation when spelling gets in the way, and especially with sounds like the /j/ at the beginning of union. (My Pig Latinization of this is /unjәnjej/ in IPA—I'm not sure how I'd write it in regular orthography; perhaps oonionyay—but I believe some people would say /junjәnej/ (unionay) or /junjәnwej/ (unionway).)
Despite the variation and the possibility of orthographic interference, there are many cases where human speakers' intuitions are straightforwardly based on sounds, but where a simplistic computer program based on the assumption that vowels and consonants are types of letters will go quite ridiculously astray. I was reminded of this by a recent post by P. Z. Myers, who reports on a site called "Is this your name?"; the site provides various facts and factoids about one's first and last names, and gives a "guesstimate," based on (oldish) census data, of how many people with that combination of names there are in the United States. Myers, because he gave his first name as PZ, was unsurprisingly predicted to be unique. (He was also characterized as "poorly envoweled.") What caught my eye, though, was the translation of the name into Pig Latin: Pzay Ersmyay.
Leaving aside the inherent difficulty in trying to process Pz as a name rather than a pair of initials, I don't think any human being who knows how the name Myers is pronounced would come up with Ersmyay as a Pig Latinization of it. The trouble is that the program seems to be based on the assumption that y always represents a consonant—so it treats the my sequence at the beginning of Myers as a consonant cluster, and moves the whole thing to the end of the word, even though the y here actually represents the diphthong /aj/. Myers in Pig Latin is /ajәɹzmej/; orthographically, I guess it'd be Yersmay, although the spelling doesn't really make the pronunciation clear at all. Human speakers differ about what to do with y in Pig Latin when it represents the glide /j/ (it's not for nothing that glides are also called "semivowels"!), but when it represents /aj/, there's no question that the sound is a vowel, not a consonant.
Not surprisingly, "Is this your name?" also produces seriously bad output for names containing vowels not represented by any letter, such as surnames beginning with the prefix Mc- (pronounced /mək/ or /mɪk/). Just for fun, I had it process the name Myra McClelland, and what it came up with was Amyray Ellandmcclay. Again, I'm not entirely sure how I would spell this name in Pig Latin (though it certainly wouldn't be like that!), but I would pronounce it /ajɹəmej əklɛləndmej/. (Maybe you could write it Yramay cClellandmay?)
If you want to write a program to translate written English into Pig Latin, then you can't get away with just saying, "These letters are consonants; these other letters are vowels; take the largest possible string of consonant-letters from the beginning of the word, move it to the end, and add ay." You actually need to know something about English pronunciation. Some words, perhaps, can just be listed—so that the program knows, for example, that uniform begins with a /j/ and uninformed doesn't. But if you're working with proper names, then you probably need to set your program up to make intelligent guesses about words it's never seen before. Phonotactic knowledge will help you here—for instance, you'll have a better chance of getting Myers right if you know that the sequence /mje/ cannot occur at the beginning of an English syllable. You almost certainly won't be right all the time, but at least you should be able to avoid Amyray Ellandmcclay.
So remember, kids, whether you want to pass LIN 100 or serve up amusing trivia to people on the Internet, your motto should be "Sounds, not letters"!