pieces auto This site will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device.

Semantic AudiovisuaL Entertainment
Reusable Objects

style element

Language Processing and Speech Synthesis Terms

Artificial Neural Network (ANN)
A network of many simple processors that imitates a biological neural network.Neural networks have some ability to "learn" from experience, and are used in applications such as speech recognition, robotics, medical diagnosis, signal processing, and weather forecasting.
Automatic Speech Recognition (ASR)
Process by which a computer convert a speech signal into a sequence of words.
Modification of the pronunciation of a phoneme because of the surrounding phonetic context. This effect is related to the inertia of the vocal tract that can not move instantaneously from one state to another state.
Continuous Speech
A continuous utterance without pauses between words.
A statistical measurement of the interdependence or association between two variables that are quantitative or qualitative in nature. A typical calculation would be performed by multiplying a signal by either another signal (cross-correlation) or by a delayed version of itself (autocorrelation).
A turn-taking exchange of audio, such as a human-to-human or human-to-computer exchange.
Digital Signal Processing (DSP)
DSP, or Digital Signal Processing, as the term suggests, is the processing of signals by digital means. A signal in this context can mean a number of different things. Historically the origins of signal processing are in electrical engineering, and a signal here means an electrical signal carried by a wire or telephone line, or perhaps by a radio wave. More generally, however, a signal is a stream of information representing anything from stock prices to data from a remote-sensing satellite. The term "digital" comes from "digit", meaning a number (you count with your fingers - your digits), so "digital" literally means numerical. A digital signal consists of a stream of numbers, usually (but not necessarily) in binary form. The processing of a digital signal is done by performing numerical calculations.
A sound consisting of two phonemes: one that leads into the sound and one that finishes the sound. For example, the word "hello" consists of these diphones: silence-h h-eh eh-l l-oe oe-silence.
Diphone concatenation
The text-to-speech engine concatenates short digital-audio segments and performs intersegment smoothing to produce a continuous sound.
A statistical measurement for comparing elements defined by variables or vectors using scalar or vector subtraction of those elements. Examples: distance=a-b, |a-b|, (a-b).5 or two vectors may be treated as objects such that the straight line distance is measured between them.
Dynamic Time Warping (DTW)
It is the temporal domain equivalent of instance-based learning with a complex distance function. In dynamic time warping, the problem is represented as finding the minimum distance between a set of template streams and the input stream. The class chosen is the ``closest template. However, rather than using a straightforward technique of comparing the value of the input stream at time t to the template stream at time t, an algorithm is used that tries to search the space of mappings from the time sequence of the input stream to that of the template stream, so that the total distance is minimised. This is not always a linear path.
Electroglottograph (EGG)
Device for measuring throat movement caused by speaking.
Stimulation of the vocal tract by vibratory action of the vocal cords or by a turbulent air flow. In a digital system, the vocal tract is typically modelled with a filter and excitation of the filter is performed using time representations of pitch (voiced excitation) and noise (unvoiced excitation).
Hidden (Markov model) Tool Kit (HTK)
The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. http://htk.eng.cam.ac.uk/
HMM (Hidden Markov Models)
One out of many statistical models commonly used in audio content analysis & speech recognition, such as for pattern recognition and predicting following events (i.e., phonemes in speech) based on the sequence of prior events. HMM is also particularly useful because they can be trained automatically.
Technically, the Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution. It is only the outcome, not the state visible to an external observer and therefore states are "hidden" to the outside; hence the name Hidden Markov Model.
International Phonetic Alphabet (IPA)
A standard system for indicating specific sounds, first introduced in 1886. The Unicode character set includes all single symbols and diacritics in the most recent revision of the IPA, which occurred in 1989, as well as a few IPA symbols no longer in use.
A systematic means of communicating ideas or feelings by the use of conventionalized sounds, gestures, or marks having understood meanings.
Linear Predictive Coefficients (LPC)
Coefficients of all-pole filters used to modelize frames of speech signal.
Lombart effect
Changes in articulation due to the environment influence. For example, when a talker speaks in a environment with a masking noise, it has been reported that the first formant of a vowel increases while the second formant decreases. The difficulty with Lombart effect is a lack of understanding as to how to quantify it.
The Mel scale approximates the sensitivity of the human ear.
Mel Feature Cepstral Coefficients (MFCC)
The MFCC are the coefficients of the Mel cepstrum.The Mel-cepstrum is the cepstrum computed on the Mel-bands instead of the Fourier spectrum. The use of Mel scale allows to take better into account the mid-frequencies part of the signal.
Multilingual Text-To-Speech (MTTS)
System that uses common algorithms for multiple languages. Thus, a collection of language-specific synthesizers does not qualify as a multilingual system. Ideally, all language-specific information should be stored in data tables, and all algorithms should be shared by all languages.
Natural language (NL)
A human language, as opposed to a command or programming language traditionally used to communicate with a computer.
Natural Language Processing (NLP)
Computer understanding, analysis, manipulation, and/or generation of natural language. This can refer to anything from fairly simple string-manipulation tasks like stemming, or building concordances of natural language texts, to higher-level AI-like tasks like processing user queries stated in a natural language. Natural language understanding is one of the hardest problems of Artificial Intelligence due to the complexity, irregularity and diversity of human language and the philosophical problems of meaning.
Actual sounds produced by the vocal tract while speaking and that correspond to an ideal phoneme definition.
The smallest structural unit of sound in any language that can be used to distinguish one word from another. A phoneme is often defined as an ideal sound unit associated with a set of articulatory gestures. The phonemes of a language, therefore comprise a minimal theoretical set of units that are sufficient to convey all meaning in the language. Due to many different factors (accent, gender, coerticulatory effects, ...), a phoneme will have a variety of acoustic manifestation. The related and actual sounds that are produced in speaking are called the phones.
The measurable frequency or period at which the glottis vibrates.
Pitch Synchronous Over-Lap and Add (PSOLA)
Algorithm to independently modify the fundamental frequency and duration of a speech signal. Used during concatenation of selected units from a finite speech database such that minimal prosodic damage occurs due to target/selected unit mismatch.
A collection of phonological features including pitch, duration, and stress, that define the rhythm of spoken language.
The communication or expression of thoughts in spoken words.
Speech Application Language Tags (SALT)
A markup language extension that integrates speech services into existing markup languages such as HTML and XHTML.
Speech Application Programming Interface (SAPI)
Microsoft Speech application programming interface. A set of routines, protocols, and tools that enable programmers to build speech-enabled applications for Microsoft Windows platforms.
Speech Synthesis Markup Language (SSML)
An XML-based markup language used to control various characteristics of synthetic speech output including voice, pitch, rate, volume, and pronunciation and other characteristics.
Text normalization
The process of converting abbreviations and non-word written symbols (for example, "7") into words that a speaker would say when reading that symbol out loud.
Text to speech (TTS)
The process of converting text into spoken language by breaking down the words of the text into small units, analyzing the input for occurrences that require text normalization, and generating the digital audio for playback. Used in voice-processing applications requiring production of broad, unrelated, and unpredictable vocabularies, such as products in a catalog or names and addresses. This technology is appropriate when system design constraints prevent the more efficient use of speech concatenation alone.
Text-to-speech control tags
Instructions that can be embedded in text sent to a text-to-speech engine to improve the quality of the spoken text.
Tones and Break Indices (ToBI)
ToBI is a framework for developing community-wide conventions for transcribing the intonation and prosodic structure of spoken utterances in a language variety. A ToBI framework system for a language variety is grounded in careful research on the intonation system and the relationship between intonation and the prosodic structures of the language (e.g., tonally marked phrases and any smaller prosodic constituents that are distinctively marked by other phonological means).
Note: ToBI is not an International Phonetic Alphabet for prosody. Because intonation and prosodic organization differ from language to language, and often from dialect to dialect within a language, there are many different ToBI systems, each one specific to a language variety and the community of researchers working on that language variety.
A record of a speech-based conversation converted into written text.
Speech sounds produced by a turbulent flow of air created at some point of structure in the vocal tract and usually lacking pitch.
Something which is spoken. It could be a question, an answer or a statement. An utterance is usually short, two or three sentences at the most, perhaps a small part of a dialogue.
Vector Quantization (VQ)
Vector quantization is a data compression method where a set of data points is encoded by a reduced set of reference vectors, the codebook.
Speech sounds produced by vibratory action of the vocal cords and usually having pitch.