Media Resource Control Protocol (MRCP) is a new IETF technology which provides a standard internet interface to speech processing resources like synthesis (i.e., text-to-speech) and recognition. This interface involves an acronym soup of other IETF and W3C standards including SIP, SDP, RTP, SSML, SRGS, NLSML, and PLS. Although MRCP is stable and in use in the wild, to date there has been no tutorial or documentary material other than the IETF specifications.
This book will come as a double relief to anyone working in the area: first that such material has arrived, and second that it is of such high quality.
The book is in five parts, progressing from background material, through the mechanics of MRCP, to an example of MRCP and VoiceXML in action.
Part I gives background on MRCP and speech processing in general. The chapter on speech processing is best regarded as interesting supplementary (i.e., optional) reading, though a few references to more solid literature are given. The chapters on MRCP provide a worthwhile historical and architectural context.
Part II explains the nature of MRCP sessions. An MRCP session is a complex entity, involving the Session Initiation Protocol (SIP) and the Session Description Protocol (SDP) to set up two separate channels: a media session running over the Real-time Transport Protocol (RTP) to carry the audio data, and a control session running over TCP to carry control messages in MRCP message format. Consequently, this section carries a lot of responsibility.
Part III covers the xml formats used in the bodies of the MRCP control messages: i.e., Speech Synthesis Markup Language (SSML), Speech Recognition Grammar Specification (SRGS), Natural Language Semantics Markup Language (NLSML), and Pronunciation Lexicon Specification (PLS). As you can imagine, this section is tedious but necessary.
Part IV describes the resources that an MRCP server can provide: synthesiser, recogniser, recorder, and (speaker) verifier.
The final Part V introduces VoiceXML, describes how VoiceXML and MRCP interact and demonstrates this interaction with a small application example.
Three so-called appendices give overviews of the deprecated MRCP version 1, HTTP and XML.
The writing is clear and direct, and the coverage is comprehensive, thorough and explicit throughout. Although the book is not explicitly split into 'tutorial' and 'reference' sections, it fulfills both uses admirably. It is authoritative and dependable.
Virtually every topic covered has an explicit example: from just a snippet of XML or an SIP message header, to fully annotated walkthroughs of SIP or MRCP client/server sessions. These examples are never overlong and are always to the point.
There are a couple of minor errors in chapter 2 on the basic principles of speech processing:
- p12&p31: it is not possible to produce "sounds that are simultaneously voiced and unvoiced (e.g., the sound associated with 's' in 'is', pronounced like a 'z')". The example given is a voiced fricative, the noise (i.e., aperiodic acoustic signal) being caused not by devoicing, but by constriction in the vocal tract.
- p14 Figure 2.3: the items labelled allophones (possible phonetic variations within a phoneme) are actually triphones (ordered sets of three phonemes).
These errors are very minor and do not obstruct understanding. As chapter 2 is probably the only dispensible part of the book I imagine few readers will even come across them.
The appendices are 'so-called' because they are not actually appended to the book: they are available only as PDFs from theauthor's website. With the 'appendices' on HTTP and XML it's no great loss, but Appendix A on differences between MRCP versions 1 and 2 is important and should have been included in the real book.
My only real disappointment is that the code from the walkthrough examples is not available for download. These walkthroughs are effectively verbatim transcriptions of MRCP sessions and as such would be extremely useful to people developing or testing related software.
MRCP is a very new technology and this book its only substantial documentation, barring the IETF's RFCs. Apart from a couple of minor quibbles this book is the model of what a first book on a new technology should be like. I recommend it highly to anyone working on speech processing over IP, or indeed to anyone thinking of writing a book on a new technology.