What Americans Need to Know About Chinese “Tones”

March 11th, 2009

I always had trouble with the spoken language when I first started learning Mandarin. A friend in my dorm, Ann, gave me the most exasperated looks when she offered to help, then found herself helplessly lost in an endless loop with me.

Her: “Now, repeat after me. Bu.”

Me: “BOO.”

Her: “Bu.”

Me: “BOOooU?”

Her: “Bu!”

Me: “Hmm.”

Obviously, something wasn’t translating. In a normal Chinese class these days, you’ll get taught that there are four or five tones that you can use when pronouncing a syllable. When you get taught, your teacher will most likely speak verrry slowwwly, and overemphasize his or her pronounciation, with sharp changes in pitch.

When I was a student, my professor, Zhuang lao-shi, taught us through this method. We repeated after her: “BOO”, “booOOO”, “BooOO”, “BOoo”, singing awkwardly and uncomfortably through the lesson. To Americans, the word “tone” makes us think of tonal pitch, and the do-re-mi-fa-so-la-ti-do training we endured as children – so, when I was hearing this stuff, all I could tell was that each tone was supposed to change in pitch. First tone – high pitch. Second tone – low to high. Third tone – kinda starting in the middle, getting low, then going back up. Fourth tone – starts high, then ends low. Fifth tone – well, let’s just say that fifth tone was inscrutable, as it’s supposed to be “toneless.” But, if you’re anything like me, you might have wondered wonder how anything can’t have a pitch, the same way a young chemistry student might wonder how a solid could have a pH acidity reading if you can’t dip your pH paper into it.

So, in spite of my confusion, I continued onwards, speaking slowly and in a way that caused Chinese people to ask why I didn’t speak with my real voice. I got through three years of Chinese training in college, but unfortunately Chinese departments are so deathly afraid of losing all their students that they’ll give you A’s if you can just read and write. I sounded nasal and weird, and wasn’t precisely sure what was going on that was wrong, and left with the real misconception that Chinese have some unnatural ear for pitch that Americans don’t.

Then, years later, in the middle of a board game in which I was fumbling through my Chinese, I heard someone pronounce a syllable in an interesting way. “HuuuUUUUU,” went the word. It’s what you say when you switch out a piece in Mah Jong, and when heard in this context, it was spoken as a monotone that increased slowly and confidently in intensity, ending with an emphasis, as if a parent is giving a stern warning meant to discourage, and ends up emphasizing the second syllable to leave an ominous warning at last, i.e. “gordoNNNN… don’t you dare touch that cookie!”

It gave me pause, because the slowness of the pronunciation made it obvious to me that it was the fabled Second Tone… but the monotone threw me for a loop. If it could be that the pitch could be held constant, but the emphasis could change within a single syllable, was that what I was missing out on all those years in class?

Could it be that the word “tone” just struck an immediate, lasting, and fundamentally incorrect mental assumption about spoken Chinese that let me block out other variations in the pronunciation of words, such as emphasis and intensity?

Experience implied that an inconsistency so striking was worth a look. So, for several months, I attempted to change the way I listened to Chinese, listening for emphasis instead of pitch. Slowly, the importance of emphasis began to unfold before me, and I realized that steady strong emphasis was the telltale sign of first tone. An emphasis that slowly built like a crescendo was a sign of second. Fourth tone, with it’s sharp initial emphasis and quick drop-off, became a completely different beast to me – pronounced in this way, it sounds very harsh to Western ears, and can make the speaker sound agitated or angry in excited contexts. Third was like fourth, with a quick return to emphasis that hangs in the air, more enunciated the slower the word is pronounced. Fifth tone finally made sense – the lack of any emphasis at all, or a “soft” word.

Word after word, phrase after phrase, I started hearing things in a completely different light.

“BAA-ba”, pronounced with a constant strong emphasis in the first syllable, with a soft or no-emphasis second word following, is the way you say “father,” so you can imagine the difference between that, and nasally pronouncing the first syllable with a high pitch and then searching for an inexplicable no-pitch sound in the second.

Changing the way I understood tones made it much easier to listen to the flow of spoken Chinese. So much of real-world Chinese is relatively pitch-less but emphasized very dramatically, that it can be completely overwhelming and foreign to Americans who mistook tones for being only pitches.

In reality, much of American spoken English contains within it embedded meaning based on the emphasis within words and within sentences. I doubt that this is commonly taught to foreign speakers of English, but it’s there nonetheless. Americans definitely can understand how the emphasis put on a word can change its meaning significantly. Not only that, but there are far more than five tones in English (how do you classify “Shiiieeeet”?)! So why don’t Americans pick up on this more quickly? I have a theory.

The first reason is that horrible translation of “Si Sheng”, “Four Tones”. Tone connotes pitch too strongly in English, and I fear that it sends people down the wrong cognitive path, as it did for me. Secondly, Chinese are so certain that Americans just can’t speak Chinese, that they teach it by speaking extremely slowly, carefully enunciating with wide variations in pitch. When we attempt to mimic that, I believe that our brains pick up on the pitch changes first, and then as the words speed up, the teachers move to naturally using emphasis while the students are left stumbling over pitch changes that are completely foreign to Americans. Many completely give up, and just end up speaking all words as emphasis-less monotone, expecting all native Chinese speakers to be as patient and encouraging as their poor teachers.

The third one is a bit more confrontational to discuss. If you take a look about the Wikipedia entry about pinyin, the generally-accepted Romanization of Chinese words, you’ll see that it explicitly states that tones are changes in pitch. The graph you see on the right-hand side is pretty much the same thing that was given to me in handout form when I started learning. It’s the establishment way of teaching people, but it’s not enough. Sure, if you tune into a government official speaking to the public, they will sound almost musical in nature, shifting pitch in a slow, deliberate manner, and in this way, they express the official-ness of their words. But, if you pay attention, you’ll notice what’s present in every bit of conversational Mandarin – the sharp contrasts of emphasis that stand out with every syllable.

How would you fix it, and make it easier for Americans to learn Chinese? I’d definitely just abandon the concept of tone-as-pitch. Just throw it overboard. It’s too confusing for Americans to focus on pitch, and I think we’d learn less bad habits if we were given some mental symbolism that didn’t send us down the wrong path. Perhaps emphasis is the right word, perhaps not. At the least, the example in differences between pronunciation of a long-drawn out second tone, and a quick, spitting emphasis in fourth tone could go a long way towards demonstrating the difference. Having both male and female teachers during lessons about spoken Chinese might also help us get rid of our habits of attempting to mimic pitch only. In addition, learning several common Chinese sayings and then breaking them down might help. Once I learned how to pronounce “Gong Xi Fa Cai” (“Happy Chinese New Year”) perfectly from a friend, it was easy – but if I’d have started by reading the pinyin out loud, I’d have sounded quite ridiculous.

Nowadays, I’ve improved slightly in pronunciation and a great deal in comprehension, but unfortunately am past the days when I have a ton of time to practice. In reality, there’s both pitch, emphasis, and more going on inside the rhythm and meter of everyday Mandarin, and even knowing what I know now, it’s still a difficult mountain to climb. If I could have had this revelation earlier, I think it could have helped out, but it’s my hope that challenging the status quo on this one might do some good even if I’m hopeless. :)

Fun

PHP $_POST array empty although php://input and raw post data is available?

February 24th, 2009

I just helped a friend get through a devious issue with his PHP installation that I thought i’d blog about. There are other posts discussing problems related to having a $_POST array be completely empty although reading POST data directly via php://input wrappers works.

I used a similar test file to try and determine what was happening in the first place:



<?php
print "CONTENT_TYPE: " . $_SERVER['CONTENT_TYPE'] . "<BR />";
$data = file_get_contents('php://input');
print "DATA: <pre>";
var_dump($data);
var_dump($_POST);
print "</pre>";
?>
<form method="post">

    <input type="text" name="name" value="ok" />
    <input type="submit" name="submit" value="submit" />

</form>

If you submit the form and you see this as the result:



CONTENT_TYPE: application/x-www-form-urlencoded
DATA:

string(21) "name=ok&submit=submit"
array(0) {
}

… then that means that your browser is correctly submitting the right CONTENT_TYPE, and is also sending the browser POST data correctly. PHP is also seeing the right raw post data via the raw POST input wrapper, but the $_POST superglobal is totally empty.

What this eventually got tracked down to was an update in the php.ini that changed post_max_size from “8M” to “10MB”. Did you catch that? The use of “MB” instead of “M” was invalid, but instead of throwing an error on startup, PHP internally interpreted this as a “0″ since it was invalid. Since the post_max_size was effectively zero, nothing made it into the $_POST array.

Hope this helps anyone else debugging a similar issue! Oh, and hopefully it goes without saying, make sure you delete this test file as soon as you’re done, but it’s a giant XSS hole waiting to happen.

Tech

Gonna play with RRDs soon.

February 4th, 2009

It just occurred to me that RRDs might be really well-suited to some general purpose roles as counters, statistics trackers, etc. for use in webapps. There are statistical analysis tools built in, so I could potentially see it as a way to (locally) rate limit API requests, track certain types of errors, etc. It’s also very fast, writes back to disk, and already plays very nicely with a variety of graphing and monitoring tools.

I’ll do some experimentation soon and report back.

Tech

Steer Mouse and the Middle Click button

January 30th, 2009

I was shopping online for a replacement for my bluetooth wireless mouse for mac, and I read that there was an alternative mouse driver that might let me tweak the acceleration settings to something more acceptable.

I ended up downloading and trying Steer Mouse, and it’s actually pretty good at doing that. You can bring down the “Tracking” and bump up the “Sensitivity”, and it won’t be as wonky. Qualitatively, there’s also less noticeable lag between mouse movement and actual input (something that’s been driving me crazy as well). It may hold up better in heavy application use as well, time will tell.

However, one thing I wasn’t willing to live with was the default remapping of middle click to “Move to Close Button”. To change it to work with Firefox new tabs again, you can remap it to the Click action with a apple-key modifier. Seems like that works again. I’ll let it go through the trial period and see if I still like it after some more use.

Tech

Costs of Microformats, cont.

January 13th, 2009

My earlier post about the costs of microformats led to some interesting comments that I’d like to respond to at the top level here.

In this comment, André Luís brings up his personal experience with ease of publishing hAtom, and also points out that it may have just been my publishing platform which gave me difficulty. He also mentions that i’ve overlooked added value for power users.

Shira believes that many of the problems I listed will go away with maturity of the microformats project, and are present because of its work-in-progress status.

Yes, it’s true that there will be some cases where publishing microformats will be a technically straightforward matter (especially with say, hAtom or hCard). I don’t wish to say that it’s always technically difficult, but what I’d really love to get across is that there are not only potential technical landmines, but also policy and relationship complications that can result from such a move. In the case of my publishing platform, which was, in my case, in-house written code, there were extreme difficulties in attempting to contort our markup to something that would follow microformat structure expectations in hCalendar and hCard. Additionally, publishing microformats resulted in ongoing maintenance, policy, and backwards-compatibility costs that were not considered at the get-go. No matter what method you use to reach out to developers or power users, you will need to pay the price to keep those relationships healthy, and your data strong. Microformats don’t make your job any easier, because you’ve just intermingled the methods you use to serve your general user base and your developer base. Haphazardly publishing microformats while keeping your developer API users happy is going to cause grief for some users, and will likely be interpreted as sloppy behavior.

I certainly have avoided discussing any of the potential value of publishing microformats, as the title of my previous post may let on. I find that nearly all of the information available solely discusses the benefits without spending due time on the costs. Furthermore, the benefits of microformats (as well as the potentially reduced technical costs that Shira refers to) are, from a pragmatic standpoint, hypothetical. As a writer, it’s my hope that by spreading more complete information, people can make more educated decisions, and I feel that what’s missing in the hype about microformats is the discussion of costs. So, I apologize if I don’t spend much time on the benefits, but I really feel you can get more of that information elsewhere, as long as you remember that those benefits are typically projected onto a vision of the future where microformats are omnipresent. I’ll just say that I don’t see many benefits that aren’t already available to publishers who conform to the existing standards that microformats are based on.

So, what is the likelihood of a world in which microformats are an accepted standard in XHTML publishing? What are the chances that your real, tangible costs will eventually be paid back by real, tangible benefits? How would you begin to evaluate or consider that likelihood?

Typically, standards succeed when there is a massive market need that has not yet been met. However, there already exist viable, well-known methods of serving the needs of both browser users and developers, so I don’t see why these large publishers should invest in serving the “power users” market in between those two, when these power users needs are so close to that of developers. The power users are often served by the standard data formats underlying microformats anyway. Also, the developers often take developer platforms and build products specifically targeted towards power users! So, I don’t see a large, unmet need. To me, it just seems like an product aimed at serving a niche user base that is already being served pretty well.

Also, there is the fundamental matter of data quality that drives the growth of a standard. If your data is questionable, nobody will want it. Or, they’ll collect it and throw it away. If your data is valuable, consumers will go to great lengths to take in and use as much as your policy allows. This natural magnetism that high-value publishers have predates microformats, and explains well why there are dominant platform APIs that don’t seem to play by the rules of standardization. Sure, there is a high technical barrier to entry, but typically, these producers invest heavily on data quality, policy, AND their developer relationships. Relationships between publishers and consumers around valuable data is what drives your average successful Web 2.0 platform, not whether the publishing format is standardized. So, if you want to really predict adoption of microformats, pay attention to the data quality of published microformat data.

If you want to consider the potential of momentum to bring about a microformat-based future, you’ll pay attention to the spread of adoption of microformats, and the hype surrounding them. As mentioned above, I would encourage you to also frame your perspective according to the data quality of data made available via microformats, but that requires a lot of time and energy to really evaluate. The vision here is pretty blurry; I think that’s just the nature of hype. It’s easy to get excited about something before the costs come into the picture.

Essentially, by publishing microformats now, you are becoming an early adopter. These standards are not mature, which Shira is essentially pointing out. The costs may one day be lower, but by getting into the game now, you are not only placing a bet that many others will follow, but also they are subsidizing the costs of publishing for latecomers. I don’t think that the current tangible costs outweighs the current tangible benefits plus the hypothetical benefits multiplied by the likelihood of them coming to pass. :) But don’t take my word for it. I’d just encourage you to at least make an educated decision, and consider all of the costs I listed in my previous post before taking the plunge.

Tech

A Warning About the Real Cost of Microformats

January 8th, 2009

I’m done with microformats. From now on, i’m either building separate developer tools and relationship, or i’m not. I say that having been through the cycle of adopting hCalendar and hCard a few times, not just as an industry commentator. My reasons are threefold. First, in the real world, publishing microformats requires you to rewrite DOM structures, publish extraneous invisible elements, adopt new schemas, and adopt data publishing-like structures on frontend pages intended for the browser. Second, the relationship between publisher and developer is not significantly improved by microformats, and would be better served by a separate pathway for developers. Third, its proclaimed benefits as a standard are extraneous because hCard, hCal, etc. are valiantly attempted, but incomplete redefinitions of existing standards like vCard and iCalendar in XHTML-like formats.

The idea of microformats sounds swell, inexpensive, and easy. Take your existing data, and surround the data-ish bits with tags that separate it into parts with semantic meaning. Unfortunately, saying that something is easy does not make it so. I don’t mean to disparage the work that microformats mavens have done, but my experience with being a microformat publisher has shown that things are exponentially more complex than they let on in the “sales pitch.” I don’t think they realized what they were getting into when they started on the process of actually getting publishers to conform to this stuff.

Surrounding your existing markup sounds simple enough, right? But consider how you’ll have to nest things together. Should you publish microformats on both listings and detail pages? Are there required fields that aren’t present in the content you already have on the page? Do you just publish lots of invisible XHTML content to the page to fill in the missing stuff? Do you dare to deal with recursions? What if your data is split up in separate divs? What happens when your data model does not fit with the standard? What if you are not a very good HTML contortionist? What happens when your presentation is different based on varying pieces of available data? Is hCard publishing even useful at all on public pages without private, uniquely identifiable information? Oops, there are no microformat validators, either. This is especially difficult for something with as many ominous implications as a stepchild of iCalendar.

Much like an oral agreement, publishing microformats is an informal agreement between you and (hopefully) a developer community that sets up a relationship with plenty of vagueness, inertial resistance to change, and potential landmines to step on. Would you create a real developer API without a TOS, agreement, or at the very least, guidelines? Are you prepared to deal with objections if, when cutting costs, you rev a frontend design and lose some important aspect of microformat structure on the page (or, god forbid, you just don’t bring microformats over at all). Alternatively, are you prepared to announce all frontend markup changes? Does publishing a microformat without a special agreement mean that you are implicitly allowing comprehensive scraping of your web data? If you spend an hour seriously considering the costs of treating your frontend interface as a programming API for the sake of your relationship with developers, would you then rather spend those costs there, or on proper versioning, documentation, and communication with a developer community over a real publishing protocol or API? Publishing microformats while not having formal consumer support is a commonly what happens, but it is a poor midway point to leave yourself at.

The only place I can still imagine microformats surviving a cost/benefit analysis is in the case of preparing for search engine crawlers. In theory, if everyone publishes their websites according to a few semantic standards, the Big SE’s can embed structured data in their search results and act as aggregators of real data. There are a couple practical issues that you’d have to ignore in order to go for this pipe dream, though. You’d have to hope that the structure of microformats is fault-tolerant enough to survive the endless mangling of random developers trying to publish junk data, and that whatever was clean enough to make it through would be parseable, high quality, not eliminated in dupe checks, and relevant to a big enough segment of searches to justify the cost for the search engines to link to it. Second, you’d have to accept the fact that you probably wouldn’t get any permanent special treatment in search results, lest microformats become a new meta tag for SEO. Third, you’d have to assume that the big SE’s wouldn’t take these handful of high-order data entities and send it to their big topical datastores for aggregation and republishing themselves. I realize the benefit of a “standard.” However, why wouldn’t big aggregators crawl the web for content that conforms to existing, more mature standards that microformats are based on? Especially when these have often been passed through working, real world consumers of vCard/iCalendar before making it to production.

Anyway, here’s the question I want to put into the reader’s mind: should one spend time and effort making a frontend into an informal API through microformats, or to instead spend it on building a fully supported API or data publishing system that exists and operates separately? I think my stance is clear – i’m not against the theory of microformats, but i’m certainly going to differ with anyone who thinks it’s practical. If you can really think all that through, and still think microformats are a good thing to spend your resources on, then by all means, give it a shot. Just don’t say you weren’t warned.

Tech

Couldn’t find libtidy with utidylib on MacOSX

January 5th, 2009

Quick tip. Since the dev only built a windows binary for version 0.2.1 of uTidyLib, this bugfix for locating the dynamic libtidy on macosx is necessary.

Just edit tidy/lib.py, re-run the setup install script, and it suddenly can find libtidy.

Tech

Resolving a Green Color Band on the Woot Infocus DLP

December 9th, 2008

As some of my friends know, i’ve had a love/hate relationship with the Woot-sold Infocus DLP 61md10 HDTV. I had one bulb fail nearly 10 months in, and I lucked out and got a warranty replacement. After moving this past weekend, I had a weird green colorband floating up the screen very slowly on both component inputs. At the time, I didn’t realize that it might have been related, but I had a very loud hum when putting the audio leads directly to the TV to power the stereo.

I did some research about color weirdness with DLP’s, as well as specific problems associated with the Infocus / RCA model that I have, I was worried that the issue might be the colorwheel on the lamp assembly, or (horror of horrors) possibly the entire “light engine.” I blew out the colorwheel with some compressed air, but didn’t notice any ball bearing noise at all, so I didn’t think that was the issue.

The next day, I was able to hook up an HDMI input and verify that the problem didn’t exist digitally (phew). This post referring to 60hz power line interference made me curious about the cable line coming in from Charter to the cable box. I routed that cable line through my cheesy Monster Cable surge protector (which has 3 coax in/out thingamajigs), and that cleared up the issue – no more weird green color band, and no more loud hum. I am assuming that there’s some sort of voltage regulator or noise filter on those coax passthrus, but this is the first time i’ve been glad that Woot gave away those surge protector kits along with the TV.

Now, hopefully I can enjoy another couple thousand hours of DLP television without spending $500 bucks on a new lamp. :)

Tech

Dogg, It Must Feel Sick As Hell To License Patents From More Patent Trolls

November 24th, 2008

RPX, a new startup, has come onto the scene offering to aggregate a patent portfolio, and offer a covenant not to sue to companies who license their entire portfolio in aggregate. They are funded by venture capital and also claim to have IBM and Cisco as initial clients.

Dogg, that reminds me of some cards.

Much like Roast Beef’s greeting cards, a technology patent originator can fathomably come up with an infinite pool of them by just applying the same rubric to a new set of circumstances (in the patent trolls’ case, by appending 14 pages of a definition of a computer to tie a business process to a tangible machine — le voila!). This means, as a creator of software technology, you are potentially on the hook for limitless amounts of patent infringement, as nearly everything you can do on a computer has had a patent filed on it in the past 20 years. The mere fact that one entity is aggregating a portfolio of patents has no bearing on the fact that there are potentially unlimited amounts of patents that can be aggregated by others, much as readers of Achewood would probably be able to assist Roast Beef in coming up with limitless amounts of new greeting cards.

It looks like TechCrunch is perhaps appropriately skeptical. Their business model revolves around subscription fees instead of enforcing patents through offensive litigation. If i’m understanding these terms correctly, if you don’t subscribe to their patent portfolio, then you are potentially subject to offensive litigation (which is the practice they are claiming to fight against). Correct me if i’m wrong, but doesn’t this seem to you to just be an extension of existing patent trolling practices (i.e. adding on subscription revenues to existing litigation + settlement proceeds) that is made feasible by the aggregation of large amounts of patents with a large amount of venture capital, instead of an attack on the NPE cottage industry? If so, wouldn’t it seem disingenuous to frame oneself as such?

It is not possible for one entity to aggregate all broad technology patents that pose a threat to oneself. Therefore, the subscription of a company to one particular NPE’s portfolio does not preclude the possibility of being sued for infringement by others. If a subscription model proves to be a more profitable / reliable revenue stream, you can expect other NPEs to adopt it too. Who knows, perhaps portions of the subscription model will be patented by RPX, and they’ll force other NPEs to license their patent as well if they want in to the game. It would be like patenting some aspects of the business process of patent trolling, which would be a fun irony, but still a sad one.

Some days, it seems like buying rights gets you a better return than buying property, but mainly if you’re an asshole.

Tech

My PHP crap, anyone interested?

November 21st, 2008

I’ve been working on my own PHP-based project for a month or so, and have rewritten some basic generic libraries from scratch in PHP. I’m pondering whether I should open source them, and would appreciate it if you would comment on this post if you’re interested in seeing me put in the effort to write docs, standardize, then toss specific portions or all of it online. I learned my lesson through Freetag; it takes a lot of energy to keep something maintained. Here’s some of what I have:

  • A database wrapper singleton around PDO
    • Simplifies code for working with “prepared statements” DBI-style. PDO prepared statements work great for getting around SQL injection, but I found them cumbersome by nature.
    • Consolidated error and exception handling with custom callback support.
    • I’ll eventually add fancier stuff in here, such as handle splitting for replicated architectures in here, and some read-after-write consistency code.
  • A simple nonce library for protecting against XSRF/CSRF
  • A simple library for handling file and image upload/resize
  • A simple code profiling utility
  • Some validation routines (I would like to add in some self-documenting API neatness that I wrote about before, when I get some time)
  • Some other random handy stuff, like arg signing, link building, human interval descriptions.

It’s all pretty consistent with my philosophy of abstracting away website-related functionality instead of architecture. With this set of limited libraries, i’m building at a pretty fast rate, and I can almost always figure out what’s going on by just looking at one or two files. I’m trying to get a nice foundation that I can build fairly serious stuff on top of without getting completely confused.

Oh, and it all depends on php/filter being enabled with a default filter of “special_chars”. Come to think of it, I might write a separate post about using that, because it’s handy.

Tech