We enclose spoken utterances in double quotation marks to distinguish them from sentences, which we print in italics. However, a spoken utterance consists of more than words. In speech meanings are communicated not merely by what is said but also by the way it is said. Read this four brief dialogues.

A: Has the Winston Street bus come yet?

B: Sorry, I didn’t understand. What did you say?

C: I’m afraid Fred didn’t like the remark I made.

D: Oh? What did you say?

E: Some of my partners say they wouldn’t accept these terms.

F: And you? What did you say?

G: You’re misquoting me. I didn’t say anything like that.

H: Oh? What did you say?

The sequence of words “What did you say?” occurs in all four dialogues but it is pronounced differently in each. Individual speaker may vary somewhat in just way they pronounce, but the four renditions can be represented as follows, where the most prominent syllable I indicated with capital letters and the rising or falling of the voice is indicated by letters going up or down.

  1. WHAT did you say?
  2. What did you  SAY?
  3. What did YOU say?
  4. What DID you say?

We produce all our spoke utterances with a melody, or intonation: by changing the speed with which the vocal bans in the throat vibrate we produce rising or falling pitch or combinations of rise and fall. By making one syllable in a sense-group especially loud and long, usually where the change of pitch occurs, we endow that word with a special prominence called accent. Intonation and accent together constitute prosody, the meaningful elements of speech apart from the words that are uttered.

Within each sense-group one word (more accurately, the stressed syllable of one word) is more prominent than the rest of group, giving special attention or focus to that word. Thus, the numerous the divisions made, the more points of emphasis there are. Compare “I’d never say THAT” with one focus and “I/would NEVer/say THAT” with three.

Typically, when speech is represented in print, italics are sometimes used to indicate the accent, but this is done only sporadically and unevenly; our writing system largely neglects this important element of spoken communication. A written transcript of a speech can be highly misleading because it is only a partial rendition of that speech. In speech, there is always an accent in some part of an utterance, and placement of accent in different parts of an utterance creates differences of meaning.

In the English language accent is mobile, enabling us to communicate different meaning by putting the emphasis in different places. The usual place is on the last important word, for instance:

My cousin is an ARchitect.

If the utterance is broken into two or more sense group, each group has its own accent. The last accent is ordinarily the most prominent of all because the pitch changes on that syllable.

            My COUsin is an ARchitect.

            My cousin EDWard, who lives in FULton, is an ARchitect.

Thus speaker can highlight one word or several words in an utterance and give special focus to that word or those words.

            The placement of accent on different words ties the utterance to what has been said previously. For example, in reply to the question “What does your cousin do?,” one might say

            My cousin

            Edward                ‘s an ARchitect.


Here the word architect is new information and stressed syllable of the name Edward or my cousin is old, or given information, reference to what was already in the discourse. Suppose, instead, that nothing had been said about anybody’s cousin but the discussion had somehow turned to architects. One might then volunteer this information:

            My cousin Edward’s an architect.

Here my cousin Edward is new information and stressed syllable of the name Edward is accented. The phrase an architect now represents given information and is de-accented.

            Accent by giving special focus to one word, can create contrast with other words that might have been used in the same place. Moving the accent to different words creates different meanings in what would otherwise be a single utterance.[1]

The Role of Prosody in Sentence Processing

Prosody is a general term for variety of acoustic features— what we hear— that ordinarily accompany a spoken sentences. One prosodic feature is the intonation pattern of a sentence. Intonation refers to pitch changes over time, as when speaker’s voice rises in pitch at the end of a question or drops at the end of a sentence. A second prosodic feature is word stress, which is, in fact, a complex subjective variable based on loudness, pitch, and timing. Two final prosodic features are the pauses that sometimes occur at the ends of sentences or major clauses the lengthening of final vowels in words immediately prior to a clause boundary (Cooper & Sorensen, 1981; Ferreira, 1993; Streeter, 1978).

            Prosody plays numerous important roles in languages processing. Prosody can indicate the mood of a speaker (happy, angry, sad, sarcastic), it can mark the semantics focus of a sentence (Jackendoff , 1972), and it can be used to disambiguate the meaning of an otherwise ambiguous sentence, such as  I saw a man with a telescope (Beach (Beach 1991; Ferreira, Henderson, Anes, Weeks, 7 McFarlane, 1996; Wales 7 Toner, 1979).

A more subtle effect of prosody is the way it can be used to mark major clauses of a sentence. Consider the sentence, In order to do well, he studied very hard. If you say this sentence aloud, you will notice how clearly the clause boundary (indicated here by the comma) is marked by intonation, stress, and timing. Note especially how the speaker automatically lengthen the final vowel in the word just prior to the clause boundary (in this case, the word well).

Although Garret and his colleagues used an ingenious splicing technique to eliminate prosodic cues, this had the effect of underestimating their importance when such cues were present. When studies analogous to the click studies are conducted, but with the formal clause boundary and the prosodic marking for a clause boundary placed in direct conflict, clicks just as often migrate to the point marked by prosody as to the formal syntactic boundary (Wingfield & Klein, 1971).

Probably the experiment that cast the most dramatic doubt on whether or not the click studies were tapping n-line perceptual segmentation rather than reflecting a post-perceptual response bias was a study conducted by Reber and Anderson (1970). They found results parallel to the original click studies even when subjects were falsely told that the sentence they would hear contained “subliminal” clicks and asked to say where they thought these clicks had occurred. Although no clicks were actually presented, subjects more often reported having heard them at clause boundaries than within clauses.

It is certainly the case that clauses are important to the way people remember speech. In one series of experiments, subjects heard a tape-recorded passage that was stopped without warning at various spots in the passage. The moment the tape stopped, subjects were asked to recall as large a segment as possible of what had just been heard. Generally, subjects’ recall was bounded by full clauses, just as one would expect if major linguistics clauses do have structural integrity (Jarvella, 1970, 1971). The importance of clause boundaries and other syntactic constituents can also be demonstrated by giving subjects tape-recorded passages and telling them to interrupt the tape whenever they want to immediately recall what they have just heard. In such cases, subjects reliably press the tape recorder pause button to give their recall at periodic intervals corresponding exactly with the ends of major clauses and other important syntactic boundaries (Wingfield & Butterworth, 1984).

We should not dismiss all elements of an autonomy principle out of hand. Indeed, we will later review evidence for some degree of autonomous processing in the form of activation of word meaning independent of the sentence context in which the word is embedded. Few writers today, however, espouse the early version of syntactic autonomy that implies that analysis at the semantic level must await completion of a full clause or sentence boundary in the speech stream.

We do not want to suggest that clauses are unimportant units in sentence processing. Rather, our question is whether both syntactic and semantic analyses occur together and continuously interact as we hear a sentence. [2]

How Prosody Improves Word Recognition

Prosody has been traditionally regarded as useless for word recognition since acoustic-prosodic features are mostly supra-segmental and are only weakly dependent on phonetic models. The only prosodic feature that has been widely used in speech recognizer is the normalized energy. Various attempts have been made to incorporate duration into phonetic or word models, but only small improvement has been achieved when duration dependent models are applied to large scale continuous speech recognition. There are also studies that attempted to incorporate pitch into speech recognizer either by conditioning cepstral observations on pitch for normalization purpose, or by including pitch as auxiliary variable to create pitch dependent acoustic model [1]. The improvement reported by these attempts is small and there are no explicit prosody knowledge built into these systems.

On the other hand, due to the dependence of prosody on high-level linguistic units such as disfluency, syntax, dialog act, topic, meaning and emotion, prosody has been successfully used to disambiguate syntactically distinct sentences with identical phoneme strings, infer punctuation of a recognized text, segment speech into sentences and topics, recognize the dialog act labels [2], and detect speech disfluencies. Can prosody ever help word recognition? Linguistic study has confirmed that humans are able to understand the content with lower cognitive load and higher accuracy while listening to natural prosody, as opposed to monotone or foreign prosody [3]. This suggests that it is possible to utilize prosody to improve automatic word recognition. [3]

Figure 1: A Bayesian network representing the complex relationship among the acoustic-phonetic features (X), acousticprosodic features (Y), word sequence(W), prosody sequence (P), syntax sequence (S) and meaning (M) of an utterance.[4]


