This site will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device.


Semantic AudiovisuaL Entertainment
Reusable Objects

style element
radius

SALERO Publications

KaleiVoiceKids: Interactive Real-Time Voice Transformation for Children
Paper (pdf)
O. Mayor, J. Bonada, J. Janer (UPF)

The 9th International Conference on Interaction Design and Children (IDC 2010) (June 2010, Barcelona, Spain)

In this paper we describe the adaptation of an existing Real-time voice transformation exhibit to the special case of children as the interacting subjects. Many factors have been taken into consideration to adapt the body interaction design, the visual feedback given to the user and the core technology itself to fulfill the requirements of children. The paper includes a description of this installation that is being used daily by hundreds of children in a permanent museum exhibition.

A Framework For Evaluating Automatic Image Annotation Algorithms
Paper (for purchase)
K. Athanasakos, V. Stathopoulos, J. Jose (UG)

2nd European Conference on Information Retrieval (ECIR 2010) (March 2010, Milton Keynes, United Kingdom)

Several Automatic Image Annotation (AIA) algorithms have been introduced recently, which have been found to outperform previous models. However, each one of them has been evaluated using either different descriptors, collections or parts of collections, or "easy" settings. This fact renders their results non-comparable, while we show that collection-specific properties are responsible for the high reported performance measures, and not the actual models. In this paper we introduce a framework for the evaluation of image annotation models, which we use to evaluate two state-of-the-art AIA algorithms. Our findings reveal that a simple Support Vector Machine (SVM) approach using Global MPEG-7 Features outperforms state-of-the-art AIA models across several collection settings. It seems that these models heavily depend on the set of features and the data used, while it is easy to exploit collection specific properties, such as tag popularity especially in the commonly used Corel 5K dataset and still achieve good performance.

A multi faceted recommendation approach for explorative video retrieval tasks
Paper (for purchase)
D. Vallet, M. Halvey, D. Hannah, J. Jose (UG)

International Conference on Intelligent User Interfaces (IUI) (February 2010, Hong Kong, China)

In this paper we examine the use of multi faceted recommendations to aid users while carrying out exploratory video retrieval tasks. These recommendations are integrated into ViGOR (Video Grouping, Organisation and Retrieval), a system which employs grouping techniques to facilitate video retrieval tasks. Two types of recommendations based on past usage history are utilised, the first attempts to couple the multi-faceted nature of explorative video retrieval tasks with the current user interests in order to provide global recommendations, while the second exploits the organisational features of ViGOR in order to provide recommendations based on a specific aspect of the user’s task.

TV News Video Story Segmentation based on Semantic Coherence and Content Similarity
Paper (for purchase)
H. Misra, F. Hopfgartner, A. Goyal, P. Puttu, J. Jose (UG)

16th International Conference on Multimedia Modelling (MMM 2010) (January 2010, Chongqing, China)

In this paper, we introduce and evaluate two novel approaches, one using video stream and the other using close-caption text stream, for segmenting TV news into stories. The segmentation of the video stream into stories is achieved by detecting anchor person shots and the text stream is segmented into stories using a Latent Dirichlet Allocation (LDA) based approach. The benefit of the proposed LDA based approach is that along with the story segmentation it also provides the topic distribution associated with each segment. We evaluated our techniques on the TRECVid 2003 benchmark database and found that though the individual systems give comparable results, a combination of the outputs of the two systems gives a significant improvement over the performance of the individual systems.

Semantic User Modelling for Personal News Video Retrieval
Paper (for purchase)
Frank Hopfgartner, J. Jose (UG)

16th International Conference on Multimedia Modelling (MMM 2010) (January 2010, Chongqing, China)

There is a need for personalised news video retrieval due to the explosion of news materials available through broadcast and other channels. In this work we introduce a semantic based user modelling technique to capture the users’ evolving information needs. Our approach exploits the Linked Open Data Cloud to capture and organise users’ interests. The organised interests are used to retrieve and recommend news stories to users. The system monitors user interaction with its interface and uses this information for capturing their evolving interests in the news. New relevant materials are fetched and presented to the user based on their interests. A user-centred evaluation was conducted and the results show the promise of our approach.

Semantic Based Adaptive Movie Summarisation
Paper (for purchase)
R. Ren, H. Misra, J. Jose (UG)

16th International Conference on Multimedia Modelling (MMM 2010) (January 2010, Chongqing, China)

This paper proposes a framework for automatic video summarization by exploiting internal and external textual descriptions. The web knowledge base Wikipedia is used as a middle media layer, which bridges the gap between general user descriptions and exact film subtitles. Latent Dirichlet Allocation (LDA) detects as well as matches the distribution of content topics in Wikipedia items and movie subtitles. A saliency based summarization system then selects perceptually attractive segments from each content topic for summary composition. The evaluation collection consists of six English movies and a high topic coverage is shown over official trails from the Internet Movie Database.

Towards Annotation of Video as Part of Search
Paper (for purchase)
M. Halvey, J. Jose (UG)

16th International Conference on Multimedia Modelling (MMM 2010) (January 2010, Chongqing, China)

Search for multimedia is hampered by both the lack of quality annotations and a quantity of annotations. In recent years there has been a growth in multimedia search services that emphasis interactivity between the user and the interface. Some of these systems present an as yet untapped resource for providing annotations for video. In this paper, we investigate the use of a new innovative grouping interface for video search to provide additional annotations for video collections. The annotations provided are an inherent part of the search interface, thus providing less overhead for the user in providing annotations. In addition we believe that the users are more likely to provide high quality annotations as the annotations are used to aid the users search. Specifically we investigate the annotations provided as part of two evaluations of our system; the results of these evaluations also demonstrate the utility and benefit of a grouping interface for video search [8]. The results of the analysis presented in this paper demonstrate the benefit of this implicit approach for providing additional high quality annotations for video collections.

Feature Subspace Selection for Efficient Video Retrieval
Paper (for purchase)
A. Goyal, R. Ren, J. Jose (UG)

16th International Conference on Multimedia Modelling (MMM 2010) (January 2010, Chongqing, China)

The curse of dimensionality is a major issue in video indexing. Extremely high dimensional feature space seriously degrades the efficiency and the effectiveness of video retrieval. In this paper, we exploit the characteristics of document relevance and propose a statistical approach to learn an effective sub feature space from a multimedia document collection. This involves four steps: (1) density based feature term extraction, (2) factor analysis, (3) bi-clustering and (4) communality based component selection. Discrete feature terms are a set of feature clusters which smooth feature distribution in order to enhance the discrimination power; factor analysis tries to depict correlation between different feature dimensions in a loading matrix; bi-clustering groups both components and factors in the factor loading matrix and selects feature components from each bi-cluster according to the communality. We have conducted extensive comparative video retrieval experiments on the TRECVid 2006 collection. Significant performance improvements are shown over the baseline, PCA based K-mean clustering.

Statement-based Semantic Annotation of Media Resources
Paper (pdf)
W. Weiss, W. Halb (JRS) T. Bürger (LFUI), R. Villa, P. Swamy (UG)

4th International Conference on Semantic and Digital Media Technologies (SAMT 2009) (December 2009, Graz, Austria)

Currently the media production domain lacks efficient ways to organize and search for media assets. Ontology-based applications have been identifi ed as a viable solution to this problem, however, sometimes being too complex for non-experienced users. We present a fast and easy to use approach to create semantic annotations and relationships of media resources. The approach is implemented in the SALERO Intelligent Media Annotation & Search system. It combines the simplicity of free text tagging and the power of semantic technologies and by that makes a compromise in the complexity of full semantic annotations. We present the implementation of the approach in the system and an evaluation of diff erent user interface techniques for creating annotations.

CorpVis: An Online Emotional Speech Corpora Visualisation Interface
Paper (pdf)
C. Cullen, B. Vaughan, J. McAuley, E. McArthy (DIT)

4th International Conference on Semantic and Digital Media Technologies (SAMT 2009) (December 2009, Graz, Austria)

Our research in emotional speech analysis has led to the construction of several dedicated high quality, online corpora of natural emotional speech assets. The requirements for querying, retrieval and organization of assets based on both their metadata descriptors and their analysis data led to the construction of a suitable interface for data visualization and corpus management. The CorpVis interface is intended to assist collaborative work between several speech research groups working with us in this area, allowing online collaboration and distribution of assets to be performed. This paper details the current CorpVis interface into our corpora, and the work performed to achieve this.

Multimedia Ontology Life Cycle Management with the SALERO Semantic Workbench
Paper (pdf)
T. Bürger (LFUI)

Workshop on Semantic Multimedia Database Technologies (SeMuDaTe2009) at 4th International Conference on Semantic and Digital Media Technologies (SAMT 2009) (December 2009, Graz, Austria)

Ontologies are gaining increased importance in the area of multimedia retrieval or management as they try to overcome the commonly known drawbacks of existing multimedia metadata standards for the descriptions of the semantics of multimedia content. In order to build and use ontologies, user have to receive appropriate support. This paper presents the SALERO Semantic Workbench which o ers a set of services to engineer and manage ontologies throughout their life cycle, i.e., from their (semi-) automatic creation through its storage and use in annotation and search.

A Simulated User Study of Image Browsing Using High-Level Classification
Paper (for purchase)
T. Leelanupab, Y. Feng, V. Stathopoulos, J. Jose (UG)

4th International Conference on Semantic and Digital Media Technologies (SAMT 2009) (December 2009, Graz, Austria)

In this paper, we present a study of adaptive image browsing, based on high-level classification. The underlying hypothesis is that the performance of a browsing model can be improved by integrating high-level semantic concepts. We introduce a multi-label classification model designed to alleviate a binary classification problem in image classification. The effectiveness of this approach is evaluated by using a simulated user evaluation methodology. The results show that the classification assists users to narrow down the search domain and to retrieve more relevant results with respect to less amount of browsing effort.

Shot boundary detection based on Eigen coefficients and small Eigen value
Paper (pdf)
P. Puttu, J. Jose (UG)

4th International Conference on Semantic and Digital Media Technologies (SAMT 2009) (December 2009, Graz, Austria)

Detection of shot boundaries in a video has been an active for quite a long time, till the TRECVID community almost declared it as a solved problem. A problem is assumed to be solved when there is no significant improvement being achieved from that of the state-of-the art methodologies. However, certain aspects can still be researched and improved. For instance, finding appropriate parameters instead of empirical thresholds to detect the shot boundaries is very challenging and is still being researched. In this paper, we present a fast, adaptive and non-parametric approach for detecting shot boundaries. Appearance based model is used to compute the difference between two subsequent frames. These frame distances, are then used to locate the shot boundaries. The proposed shot boundary detection algorithm uses an asymmetric region of support that automatically adapts to the shot boundaries. Experiments have been conducted to verify the effectiveness and applicability of the proposed method for adaptive shot segmentation.

Text segmentation via topic modeling: an analytical study
Paper (for purchase)
H. Misra (UG), F. Yvon, J. Jose (UG), O. Cappé

18th ACM Conference on Information and Knowledge Management (CIKM 2009) (November 2009, Hong Kong, China)

In this paper, the task of text segmentation is approached from a topic modeling perspective. We investigate the use of latent Dirichlet allocation (LDA) topic model to segment a text into semantically coherent segments. A major benefit of the proposed approach is that along with the segment boundaries, it outputs the topic distribution associated with each segment. This information is of potential use in applications like segment retrieval and discourse analysis. The new approach outperforms a standard baseline method and yields significantly better performance than most of the available unsupervised methods on a benchmark dataset.

A proactive personalised retrieval system
Paper (for purchase)
D. Elliott, J. Jose (UG)

18th ACM Conference on Information and Knowledge Management (CIKM 2009) (November 2009, Hong Kong, China)

We present a personalised retrieval system that captures explicit relevance feedback to build an evolving user profile with multiple aspects. The user profile is used to proactively retrieve results between search sessions to support multi-session search tasks. This approach to supporting users with their multi-session search tasks is evaluated in a between-subjects multiple time-series study with ten subjects performing two simulated work situation tasks over five sessions. System interaction data shows that subjects using the personalised retrieval system issue fewer queries and interact with fewer results than subjects using a baseline system. The interaction data also shows a trend of subjects interacting with the proactively retrieved results in the personalised retrieval system.

Vocate: Auditory Interfaces for the LOK8 Project
Paper (pdf)
J. McGee, C. Cullen (DIT)

9th Annual Conference on Information Technology and Telecommunication (October 2009, Dublin, Ireland)

The auditory modality has a number of unique advantages over other modalities, such as a fast neural processing rate and focus-independence. As part of the LOK8 project’s aim to develop location-based services, the Vocate module will be seeking to exploit these advantages to augment the overall usability of the LOK8 interface and also to deliver scalable content in scenarios where the user may be in transit or requires focus-independence. This paper discusses these advantages and outlines three possible approaches that the Vocate module may take within the LOK8 project: speech interfaces, auditory user interfaces, and sonification.

Concept, Content and the Convict
Paper (pdf)
M. Tuomola, T. Korpilahti J. Pesonen, A. Singh (TAIK), R. Villa, P. Swamy, Y. Feng, J. Jose (UG)

ACM International Conference on Multimedia (October 2009, Beijing, China)

This paper describes the concepts behind and implementation of the multimedia art work Alan01 / AlanOnline, which wakes up the 1952 criminally convicted Alan Turing as a piece of code within the art work - thus fulfilling Turing's own vision of preserving human consciousness in a computer. The work's context is described within the development of associative storytelling structures built up by interactive user feedback via an image and video retrieval system. The input to the retrieval system is generated by Alan01 / AlanOnline via their respective sketch interfaces, the output of the retrieval system being fed back to Alan01 / AlanOnline for further processing and presentation to the user within the context of the overall artistic experience. This paper, in addition to presenting the productions and image retrieval system, also presents the installation and online production user reception and some of the issues and observations made during the development of the systems.

Using facial expressions and peripheral physiological signals as implicit indicators of topical relevance
Paper (for purchase)
I. Arapakis, I. Konstas, J. Jose (UG)

ACM International Conference on Multimedia (October 2009, Beijing, China)

Multimedia search systems face a number of challenges, emanating mainly from the semantic gap problem. Implicit feedback is considered a useful technique in addressing many of the semantic-related issues. By analysing implicit feedback information search systems can tailor the search criteria to address more effectively users' information needs. In this paper we examine whether we could employ affective feedback as an implicit source of evidence, through the aggregation of information from various sensory channels. These channels range between facial expressions to neuro-physiological signals and are regarded as indicative of the user's affective states. The end-goal is to model user affective responses and predict with reasonable accuracy the topical relevance of information items without the help of explicit judgements. For modelling relevance we extract a set of features from the acquired signals and apply different classification techniques, such as Support Vector Machines and K-Nearest Neighbours. The results of our evaluation suggest that the prediction of topical relevance, using the above approach, is feasible and, to a certain extent, implicit feedback models can benefit from incorporating such affective features.

Exploiting Social Tagging Profiles to Personalize Web Search
Paper (for purchase)
D. Vallet, I. Cantador, J. Jose (UG)

Flexible Query Answering Systems, Eighth International Conference (FQAS 2009) (October 2009, Roskilde, Denmark)

In this paper, we investigate the exploitation of user profiles defined in social tagging services to personalize Web search. One of the key challenges of a personalization framework is the elicitation of user profiles able to represent user interests. We propose a personalization approach that exploits the tagging information of users within a social tagging service as a way of obtaining their interests. We evaluate this approach in Delicious, a social Web bookmarking service, and apply our personalization approach to a Web search system. Our evaluation results indicate a clear improvement of our approach over related state of the art personalization approaches.

University of Glasgow at ImageCLEF 2009 Robot Vision Task
Paper (pdf)
Y. Feng, M. Halvey, J. Jose (UG)

CLEF Workshop 2009 (September/October 2009, Corfu, Greece)

For the submission from the University of Glasgow for the ImageCLEF 2009 Robot Vision Task a large set of interesting points were extracted using an edge corner detector, these points were used to represent each image. The RANSAC method [1] was then applied to estimate the similarity between test and training images based on the number of matched pairs of points. The location of robot was then annotated based on the training image which contains the highest number of matched point pairs with the test image. A set of decision rules with the respect to the trajectory behaviour of robot’s motion were defined to refine the final results. An illumination filter was also applied for two of the runs in order to reduce the illumination effect.

A Case Study of Exploiting Data Mining Techniques for an Industrial Recommender System
Paper (pdf)
I. Cantador, D. Elliott, J. Jose (UG)

3rd ACM Conference on Recommender Systems (October 2009, New York, United States)

We describe a case study of the exploitation of Data Mining techniques for creating an industrial recommender system. The aim of this system is to recommend items of a fashion retail store chain in Spain, producing leaflets for loyal customers announcing new products that they are likely to want to purchase. Motivated by the fact of having little information about the customers, we propose to relate demographic attributes of the users with content attributes of the items. We hypothesise that the description of users and items in a common content-based feature space facilitates the identification of those products that should be recommended to a particular customer. We present a recommendation framework that builds Decision Trees for the available demographic attributes. Instead of using these trees for classification, we use them to extract those content-based item attributes that are most widespread among the purchases of users who share the demographic attribute values of the active user. We test our recommendation framework on a dataset with one year purchase transaction history. Preliminary evaluations show that better item recommendations are obtained when using demographic attributes in a combined way rather than using them independently.

University of Glasgow at ImageCLEFPhoto 2009
Paper (pdf)
G. Zuccon, T. Leelanupab, A. Goyal, M. Halvey, P. Swamy, J. Jose (UG)

CLEF Workshop 2009 (September/October 2009, Corfu, Greece)

In this paper we describe the approaches adopted to generate the runs submitted to ImageCLEFPhoto 2009 with an aim to promote document diversity in the rankings. Four of our runs are text based approaches that employ textual statistics extracted from the captions of images, i.e. MMR [1] as a state of the art method for result diversification, two approaches that combine relevance information and clustering techniques, and an instantiation of Quantum Probability Ranking Principle. The fifth run exploits visual features of the provided images to re-rank the initial results by means of Factor Analysis. The results reveal that our methods based on only text captions consistently improve the performance of the respective baselines, while the approach that combines visual features with textual statistics shows lower levels of improvements.

SALERO Intelligent Media Annotation & Search
Paper (pdf)
W. Weiss, W. Halb (JRS) T. Bürger (LFUI) R. Villa, P. Swamy (UG)

International Conference on Semantic Systems (I-Semantics 2009) (September 2009, Graz, Austria)

Currently the media production domain lacks efficient ways to organize and search for media assets. Ontology based applications have been identifi ed as a viable solution to this problem, however, sometimes being too complex for non-experienced users. We present the SALERO Intelligent Media Annotation & Search system which provides an integrated view onto results retrieved from di fferent search engines. Furthermore, it offers a powerful, yet user-friendly Web-based environment to organize and search for media assets.

GTM-URL Contribution to the INTERSPEECH 2009 Emotion Challenge
Paper (pdf)
S. Planet, I. Iriondo, J.C. Socoró, C. Monzo, J. Adell (URL)

10th Annual Conference of the International Speech Communication Association (INTERSPEECH 2009) (September 2009, Brighton, United Kingdom)

This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge. Starting from our previous experience in the use of automatic classification for the validation of an expressive corpus, we have tackled the difficult task of emotion recognition from speech with real-life data. Our main contribution to this work is related to the classifier sub-challenge, for which we tested several classification strategies. On the whole, the results were slightly worse than or similar to the baseline, but we found some configurations that could be considered in future implementations.

Vocate: Auditory Interfaces for Location-based Services
Paper (pdf)
J. McGee, C. Cullen (DIT)

Physicality Workshop (pdf) at 23rd Conference on Computer Human Interaction (HCI 2009) (September 2009, Cambridge, United Kingdom)

This paper discusses work being carried out by the Vocate module of the LOK8 project. The LOK8 project seeks to develop location-based services within intelligent social environments, such as museums, art galleries, office buildings, and so on. It seeks to do this using a wide range of media and devices employing multiple modalities. The Vocate module is responsible for the auditory aspect of the LOK8 environment and will seek to exploit the natural strengths afforded by the auditory modality to make the LOK8 system user-friendly in multiple scenarios, including instances where the user needs to be hands-free or eyes-free, or when screen size on a mobile device might be an issue. We look at what kinds of services the Vocate module will be seeking to implement within the LOK8 environment and discuss the strengths and weaknesses of three possible approaches - sonification, auditory user interfaces, and speech interfaces.

SALERO: Semantic AudiovisuaL Entertainment Reusable Objects
Paper (pdf)
G. Thallinger, G. Kienast (JRS) O. Mayor (UPF) C. Cullen (DIT) R. Hackett (BLITZ) J. Jose (UG)

International Broadcasting Conference (IBC 2009) (September 2009, Amsterdam, The Netherlands)

Broadcasters around the world are in desperate need to automate content production as much as possible. This need is twofold on the one side automatic production for good structured program parts is needed on the other hand production for different target devices is an issue. The EC project SALERO has developed a range of tools enabling automatic template based production of animation clips with virtual presenters over the past years.

In this paper we describe the workflow devised by the project to automate major parts of media production based on 3D content. This is accompanied by a description by the individual tools developed and examples from the experimental productions implemented with this tools.

Simulated Evaluation of Faceted Browsing based on Feature Selection
Paper (for purchase)
F. Hopfgartner, T. Urruty, P. Bermejo, R. Villa (UG)

Multimedia Tools and Applications (Springer Netherlands, August 2009)

In this paper we explore the limitations of facet based browsing which uses sub-needs of an information need for querying and organising the search process in video retrieval. The underlying assumption of this approach is that the search effectiveness will be enhanced if such an approach is employed for interactive video retrieval using textual and visual features. We explore the performance bounds of a faceted system by carrying out a simulated user evaluation on TRECVid data sets, and also on the logs of a prior user experiment with the system. We first present a methodology to reduce the dimensionality of features by selecting the most important ones. Then, we discuss the simulated evaluation strategies employed in our evaluation and the effect on the use of both textual and visual features. Facets created by users are simulated by clustering video shots using textual and visual features. The experimental results of our study demonstrate that the faceted browser can potentially improve the search effectiveness.

Intelligent Media Annotation & Search
Poster (pdf)
W. Weiss, G. Thallinger (JRS)

Reasoning Web 2009 Summer School (August 2009, Brixen, Italy)

Presentation of the Semantic Annotation Tool for intelligent media annotation and search.

Supporting Aspect-Based Video Browsing -- Analysis of a User Study
Paper (for purchase)
T. Urruty, F. Hopfgartner, D. Hannah, D. Elliott, J. Jose (UG)

ACM International Conference on Image and Video Retrieval (CIVR 2009) (July 2009, Syntorini, Greece)

In this paper, we present a novel video search interface based on the concept of aspect browsing. The proposed strategy is to assist the user in exploratory video search by actively suggesting new query terms and video shots. Our approach has the potential to narrow the "Semantic Gap" issue by allowing users to explore the data collection. First, we describe a clustering technique to identify potential aspects of a search. Then, we use the results to propose suggestions to the user to help them in their search task. Finally, we analyse this approach by exploiting the log files and the feedbacks of a user study.

An aspectual interface for supporting complex search tasks
Paper (for purchase)
R. Villa, J. Jose, H. Joho, I. Cantador (UG)

32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009) (July 2009, Boston, United States)

With the increasing importance of search systems on the web, there is a continuing push to design interfaces which are a better match with the kinds of real-world tasks in which users are engaged. In this paper, we consider how broad, complex search tasks may be supported via the search interface. In particular, we consider search tasks which may be composed of multiple aspects, or multiple related subtasks. For example, in decision making tasks the user may investigate multiple possible solutions before settling on a single, final solution, while other tasks, such as report writing, may involve searching on multiple interrelated topics. A search interface is presented which is designed to support such broad search tasks, allowing a user to create search aspects, each of which models an independent subtask of some larger task. The interface is built on the intuition that users should be able to structure their searching environment when engaged on complex search tasks, where the act of structuring and organization may aid the user in understanding his or her task. A user study was carried out which compared our aspectual interface to a standard web-search interface. The results suggest that an aspectual interface can aid users when engaged in broad search tasks where the search aspects must be identified during searching; for a task where search aspects were pre-defined, no advantage over the baseline was found. Results for a decision making task were less clear cut, but show some evidence for improved task performance.

On social networks and collaborative recommendation
Paper (for purchase)
(UG) V. Stathopoulos, J. Jose (UG)

32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009) (July 2009, Boston, United States)

Social network systems, like last.fm, play a significant role in Web 2.0, containing large amounts of multimedia-enriched data that are enhanced both by explicit user-provided annotations and implicit aggregated feedback describing the personal preferences of each user. It is also a common tendency for these systems to encourage the creation of virtual networks among their users by allowing them to establish bonds of friendship and thus provide a novel and direct medium for the exchange of data. We investigate the role of these additional relationships in developing a track recommendation system. Taking into account both the social annotation and friendships inherent in the social graph established among users, items and tags, we created a collaborative recommendation system that effectively adapts to the personal information needs of each user. We adopt the generic framework of Random Walk with Restarts in order to provide with a more natural and efficient way to represent social networks. In this work we collected a representative enough portion of the music social network last.fm, capturing explicitly expressed bonds of friendship of the user as well as social tags. We performed a series of comparison experiments between the Random Walk with Restarts model and a user-based collaborative filtering method using the Pearson Correlation similarity. The results show that the graph model system benefits from the additional information embedded in social knowledge. In addition, the graph model outperforms the standard collaborative filtering method.

Topic prerogative feature selection using multiple query examples for automatic video retrieval
Paper (for purchase)
P. Swamy, J. Jose, A. Goyal (UG)

32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009) (July 2009, Boston, United States)

Well acceptance of relevance feedback and collaborative systems has given the users to express their preferences in terms of multiple query examples. The technology devised to utilize these user preferences, is expected to mine the semantic knowledge embedded within these query examples. In this paper, we propose a video mining framework based on dynamic learning from queries, using a statistical model for topic prerogative feature selection. The proposed method is specifically designed for multiple query example scenarios. The effectiveness of the proposed framework has been established with an extensive experimentation on TRECVid2007 data collection. The results reveal that our approach achieves a performance that is in par with the best results for this corpus without the requirement of any textual data.

Emotional Speech Corpus Creation, Structure, Distribution and Re-Use
Paper (pdf)
B. Vaughan, C. Cullen (DIT)

1st Young Researchers Workshop in Speech Technology (YRWST 2009) (April 2009, Dublin, Ireland)

This paper details the on-going creation of a natural emotional speech corpus, its structure, distribution, and re-use. Using Mood Induction Procedures (MIPs), high quality emotional speech assets are obtained, analysed, tagged (for acoustic features), annotated and uploaded to an online speech corpus. This method structures the corpus in a logical and coherent manner, allowing it to be utilized for more than one purpose, ensuring distribution via a URL and ease of access through a web browser. This is vital to ensuring the reusability of the corpus by third party’s and third party applications.

Bayesian Mixture Hierarchies for Automatic Image Annotation
Paper (for purchase)
V. Stathopoulos, J. Jose (UG)

31st European Conference on Information Retrieval (ECIR 2009) (April 2009, Toulouse, France)

Previous research on automatic image annotation has shown that accurate estimates of the class conditional densities in generative models have a positive effect in annotation performance. We focus on the problem of density estimation in the context of automatic image annotation and propose a novel Bayesian hierarchical method for estimating mixture models of Gaussian components. The proposed methodology is examined in a well-known benchmark image collection and the results demonstrate its competitiveness with the state of the art.

Split and Merge Based Story Segmentation in News Videos
Paper (for purchase)
A, Goyal, P. Puttu, F. Hopfgartner, J. Jose (UG)

31st European Conference on Information Retrieval (ECIR 2009) (April 2009, Toulouse, France)

Segmenting videos into smaller, semantically related segments which ease the access of the video data is a challenging open research. In this paper, we present a scheme for semantic story segmentation based on anchor person detection. The proposed model makes use of a split and merge mechanism to find story boundaries. The approach is based on visual features and text transcripts. The performance of the system was evaluated using TRECVid 2003 CNN and ABC videos. The results show that the system is in par with state-of-the-art classifier based systems.

Facet-based Browsing in Video Retrieval: A Simulation-based Evaluation
Paper (for purchase)
F. Hopfgartner, T. Urruty, R. Villa, J. Jose (UG)

15th International Conference on Multimedia Modelling (February 2009, Sophia Antipolis, France)

In this paper we introduce a novel interactive video retrieval approach which uses sub-needs of an information need for querying and organising the search process. The underlying assumption of this approach is that the search effectiveness will be enhanced when employed for interactive video retrieval. We explore the performance bounds of a faceted system by using the simulated user evaluation methodology on TRECVID data sets and also on the logs of a prior user experiment with the system. We discuss the simulated evaluation strategies employed in our evaluation and the effect on the use of both textual and visual features. The facets are simulated by the use of clustering the video shots using textual and visual features. The experimental results of our study demonstrate that the faceted browser can potentially improve the search effectiveness.

Comparison of Feature Construction Methods for Video Relevance Prediction
Paper (for purchase)
P. Bermejo, H. Joho, R. Villa, J. Jose (UG)

15th International Conference on Multimedia Modelling (February 2009, Sophia Antipolis, France)

Low level features of multimedia content often have limited power to discriminate a document's relevance to a query. This motivated researchers to investigate other types of features. In this paper, we investigated four groups of features: low-level object features, behavioural features, vocabulary features, and window-based vocabulary features, to predict the relevance of shots in video retrieval. Search logs from two user studies formed the basis of our evaluation. The experimental results show that the window-based vocabulary features performed best. The behavioural features also showed a promising result, which is useful when the vocabulary features are not available. We also discuss the performance of classifiers.

The Maskle: Automatic Weighting for Facial Animation - An Automated Approach the Problem of Facial Weighting for Animation
A. Evans, M. Romeo, M. Dematei, J. Blat (FBM-UPF)

International Conference on Computer Graphics Theory and Applications (February 2009, Lisbon, Portugal)

Facial animation of 3D characters is frequently a time-consuming and repetitive process that involves either skeleton-rigging or pose-setting for morph targets. A major issue of concern is the necessity to repeat similar tasks for different models, re-creating the same animation system for several faces. Thus there is a need for reusable methods and tools that allow the introduction of automation into these processes. In this paper we present such a method to assist in the process of facial rigging: the Maskle. Based upon the standard bone-weight linear skinning animation technique, the desired distribution of vertex-movement weights for facial animation is pre-programmed into a low-resolution, generic facial mask. This mask, or ‘Maskle’, is then semi-automatically overlaid onto a newly created face model, before the animation-weight distribution is automatically transferred from the Maskle to the model. The result is a weight-painted model, created semi-automatically, and available for the artist to use for animation. We present results comparing Maskle-weighted faces to those weighted manually by an artist, which were treated as the gold standard. The results show that the Maskle is capable of automatically weight-painting a face to within 1.58% of a manually weighted face, with a maximum error of 3.82%. Comparison with standard professional automatic weighting algorithms shows that the Maskle is over three times more accurate.

Voice Processing and Synthesis by Performance Sampling and Spectral Models
Dissertation (pdf)
J. Bonada (UPF)

Dissertation at the Pompeu Fabra University (2008, Barcelona, Spain)

Singing voice is one of the most challenging musical instruments to model and imitate. Along several decades much research has been carried out to understand the mechanisms involved in singing voice production. In addition, from the very beginning of the sound synthesis techniques, singing has been one of the main targets to imitate and synthesize, and a large number of synthesizers have been created with that aim.

The goal of this thesis is to build a singing voice synthesizer capable of reproducing the voice of a given singer, both in terms of expression and timbre, sounding natural and realistic, and whose inputs would be just the score and the lyrics of a song. This is a very difficult goal, and in this dissertation we discuss the key aspects of our proposed approach and identify the open issues that still need to be tackled.

This dissertation substantially contributes to the field of singing voice synthesis: a) it critically discusses spectral processing techniques in the context of singing voice modeling, and provides significant improvements to the current state of the art; b) it applies the proposed techniques to other application contexts such as real-time voice transformations, museum installations or video games; c) it develops the concept of synthesis based on performance sampling as a way to model the sonic space produced by a performer with an instrument, focusing on the specific case of the singing voice; d) it proposes and implements a complete framework for singing voice synthesis; e) it explores the sonic space of the singing voice and proposes a procedure to model it; f) it discusses the issues involved in the creation of the synthesizer’s database and provide tools to automate its generation; g) it performs a qualitative evaluation of the synthesis results, comparing those to the state of the art and to real singer performance; h) it implements all the research results into an optimized software application for singing voice analysis, modeling, transformation and synthesis, including tools for database creation; i) a significant part of this research has been incorporated to a commercial singing voice software by Yamaha Corp.

Automatic refinement of an expressive speech corpus assembling subjective perception and automatic classification
Paper (available for purchase)
I. Iriondo, S. Planet, J. Socoró, E. Martínez, F. Alías, C. Monzo (URL)

SPECOM - Speech Communication (December 2008)

This paper presents an automatic system able to enhance expressiveness in speech corpora recorded from acted or stimulated speech. The system is trained with the results of a subjective evaluation carried out on a reduced set of the original corpus. Once the system has been trained, it is able to check the complete corpus and perform an automatic pruning of the unclear utterances, i.e. with expressive styles which are different from the intended corpus. The content which most closely matches the subjective classification remains in the resulting corpus. An expressive speech corpus in Spanish, designed and recorded for speech synthesis purposes, has been used to test the presented proposal. The automatic refinement has been applied to the whole corpus and the result has been validated with a second subjective test.

The Impact of 3D On the Future of Gaming
A. Oliver (BLITZ)

3D Entertainment Summit (December 2008, Los Angeles, USA)

Blitz Games Studios demonstrated a world first with a live demonstration of a true stereoscopic high quality interactive game running on current generation videogames consoles. Previously this had not been considered possible. SALERO tools supported the rapid development of the demonstrator game content.

The SALERO Virtual Character Ontology
Paper (pdf)
T. Bürger (LFUI) P. Hofmair, G. Kienast (JRS)

Workshop on Semantic 3D Media at SAMT 2008 - Third International Conference on Semantic and Digital Media Technologies (December 2008, Koblenz, Germany)

The SALERO project observed a lack of ontologies for the description and annotation of characters in media production. In this fi eld ontologies could be used to support media asset management, information retrieval, automated production or reuse.

This paper presents the SALERO Virtual Character Ontology which can be used to describe and annotate characters in media production and game design to support aforementioned scenarios.

Emotional Speech Corpora for Analysis and Media Production
Paper (pdf)
C. Cullen, B. Vaughan, S. Kousidis, J. McAuley (DIT)

SAMT 2008 - Third International Conference on Semantic and Digital Media Technologies (December 2008, Koblenz, Germany)

Research into the acoustic correlates of emotional speech as part of the SALERO project has led to the construction of high quality emotional speech corpora, which contain both IMDI metadata and acoustic analysis data for each asset. Research into semi-automated, re-usable character animation has considered the development of online workflows based on speech corpus assets that would provide a single point of origin for character animation in media production. In this paper, a brief description of the corpus design and construction is given. Further, a prototype workflow for semi-automated emotional character animation is also provided, alongside a description of current and future work.

Adaptación del CTH-URL para la Competición Albazyn 2008
Paper (pdf, Spanish language)
C. Monzo, L. Formiga, J. Adell, I. Iriondo, F. Alías, J. Socoró (URL)

V Jornadas en Tecnología del Habla (JTH2008) – ALBAYZIN-08 System Evaluation Proposal (November 2008, Bilbao, Spain)

In this work we describe the text-to-speech synthesis system presented to the Albayzin 2008 evaluation. The system follows the classic unit concatenation diagram based on corpus. The selection costs have been adjusted by means of genetic algorithms based method and no prosody prediction has been used. Two systems, with different waveform generation algorithm, were built, selecting one of them from a perceptual test.

Procedimiento para la Medida y al Modificación del Jitter y del Shimmer Aplicado a la Síntesis del Habla expresiva
Paper (pdf, Spanish language)
C. Monzo, I. Iriondo, E. Martínez (URL)

V Jornadas en Tecnología del Habla (JTH2008) (November 2008, Bilbao, Spain)

This work presents a new procedure to measure voice quality parameters (VoQ), jitter and shimmer. This new procedure takes into account the prosody contained in the sentence, so its effect is reduced before carrying out the measure for each parameter. In addition, in order to conduct the measure in a more reliable way, these parameters will be modified to be used in expressive speech synthesis. Finally, an evaluation is performed using a CMOS perceptual test on four expressive styles: aggressive, happy, sensual and sad; sentences generated by a text-to-speech synthesis system using a prosody modelling module, and in this way the utility of these parameters in different situations is studied.

Pitching to Partner: How to Match Industry and Research Needs
Keynote Presentation
M. Matthews, J. Webb (BLITZ)

CGames 2008 - 13th International Conference on Computer Games (November 2008, Wolverhampton, United Kindom)

This keynote presentation opened a 2-day conference aimed at stimulating debate about and sharing advances in computer games technologies. The event also aims to support researchers to refine their ideas and find new avenues for further exploration.
Mary Matthews gave an overview of the business of making games and where R&D sits within current business models – explaining that it brings both benefits and risks. There are benefits in that it can support a company to innovate but there may also be risks that servicing the project constrains the company’s resources and starves its core business. She gave pointers on structuring proposals to academic partners wishing to engage industry in research projects and used SALERO as an example of a successful collaboration.
Jolyon Webb gave a technical and artistic overview of SALERO R&D and a live demonstration of procedural generation of characters. He explained the industry drivers behind the adoption of intelligent content creation techniques and showed how SALERO is enabling a ‘Work Smarter, Not Harder’ approach.

Video Redundancy Detection In Rushes Collection
Paper (pdf)
R. Ren, P. Punitha, J. Jose (UG)

ACM Multimedia 2008 (October 2008, Vancouver, Canada)

The rushes is a collection of raw material videos. There are various redundancies, such as rainbow screen, clipboard shot, white/black view, and unnecessary re-take. This paper develops a set of solutions to remove these video redundancies as well as an effective system for video summarisation. We regard manual editing effects, e.g. clipboard shots, as differentiators in the visual language. A rushes video is therefore divided into a group of subsequences, each of which stands for a re-take instance. A graph matching algorithm is proposed to estimate the similarity between re-takes and suggests the best instance for content presentation. The experiments on the Rushes 2008 collection show that a video can be shortened to 4%-16% of the original size by redundancy detection. This significantly reduces the complexity in content selection and leads to an effective and efficient video summarisation system.

FacetBrowser: a user interface for complex search tasks
Paper (for purchase)
R. Villa, N. Gildea, J. Jose (UG)

ACM Multimedia 2008 (October 2008, Vancouver, Canada)

With the rapid increase in online video services, multimedia retrieval systems are becoming increasingly important search tools to users in many different fields. In this paper we present a novel retrieval interface, "FacetBrowser", which supports the creation of multiple search "facets", to aid users carrying out complex search tasks involving multiple concepts. Each facet represents a different aspect of the search task: an assumption of this work is that search facets are best represented by sub-searches, providing users with flexibility in defining facets on the fly, rather than using pre-defined categories or metadata information as used in many other exploratory search interfaces. Such facets can be organised into "stories" by users, facilitating users in building up sequences of related searches and material which together can be used to satisfy a work task. The interface allows more than one search to be executed and viewed simultaneously, and importantly, allows material to be reorganized between the facets, acknowledging the inter-relatedness which can often occur between search facets. The design of the FacetBrowser interface is presented, along with an experiment comparing it to a tabbed interface similar to that on modern web browsers. The results suggest that the FacetBrowser has the potential to aid users in exploring and structuring their searching effort when carrying out broad search tasks.

Collaborative awareness in multimedia search
Paper (for purchase)
R. Villa, N. Gildea, J. Jose (UG)

ACM Multimedia 2008 (October 2008, Vancouver, Canada)

Awareness of another's activity is an important aspect of facilitating collaboration between users, enabling an "understanding of the activities of others". Many environments where multimedia search is required are collaborative in nature, such as when a group of artists and animators are engaged in the production of a multimedia product. In this paper we introduce a study which used a novel evaluation methodology, where pairs of users competed to find the most relevant shots for a topic, with the aim of evaluating the role of awareness within an environment conductive to its use. Results based on event log and questionnaire data are reported, and conclusions presented illustrating some of the issues and pitfalls of using awareness in a video search system.

Metadata Visualisation Techniques for Emotional Speech Corpora
Paper (pdf)
C. Cullen, B. Vaughan, S. Kousidis, J. McAuley (DIT)

Second International Workshop on Adaptive Information Retrieval (AIR 2008) (October 2008, London, United Kingdom)

Our research in emotional speech analysis has led to the construction of dedicated high quality, online corpora of natural emotional speech assets. Once obtained, the annotation and analysis of these assets was necessary in order to develop a database of both analysis data and metadata relating to each speech act. With annotation complete, the means by which this data may be presented to the user online for analysis, retrieval and organization is the current focus of our investigations. Building on an initial web interface developed in Ruby on Rails, we are now working towards a visually driven GUI built on Adobe Flex. This paper details our work towards this goal, defining the rationale behind development and also demonstrating work achieved to date.

New Media: a narrative approach to content annotation
Paper (pdf)
J. McAuley, C. Cullen (DIT)

Irish Media Research Network National Conference (IMRN 2008) (September 2008, Maynooth, Ireland)

Recent years have seen an upsurge in the popularity of user-generated content. Sites such as Youtube and Flickr have illustrated that increasing numbers of web users are willing to publicly share their content, while equally sites such as Delicious and Blinklist demonstrate that growing numbers of users are willing to annotate each other’s content. Annotation, in this context, comes primarily under the guise of social tagging whereby users apply labels to resources in a subjective yet non-restrictive approach to subject-based indexing.

Towards measuring continuous acoustic feature convergence in unconstrained spoken dialogues
Paper (pdf)
S. Kousidis, D. Dorran, B. Vaughan, C. Cullen (DIT)

Interspeech 2008 (September 2008, Brisbane, Australia)

Acoustic/prosodic feature (a/p) convergence has been known to occur both in dialogues between humans, as well as in human-computer interactions. Understanding the form and function of convergence is desirable for developing next generation conversational agents, as this will help increase speech recognition performance and naturalness of synthesized speech. Currently, the underlying mechanisms by which continuous and bi-directional convergence occurs are not well understood. In this study, a direct comparison between time-aligned frames shows significant similarity in acoustic feature variation between the two speakers. The method described (TAMA) constitutes a first step towards a quantitative analysis of a/p convergence.

Wide-Band Harmonic Sinusoidal Modeling
Paper (pdf)
J. Bonada (UPF)

DAFx-08 - 11th International Conference on Digital Audio Effects (September 2008, Espoo, Finland)

In this paper we propose a method to estimate and transform harmonic components in wide-band conditions, out of a single period of the analyzed signal. This method allows estimating harmonic parameters with higher temporal resolution than typical Short Time Fourier Transform (STFT) based methods. We also discuss transformations and synthesis strategies in such context, focusing on the human voice.

The Need for Formalizing Media Semantics in the Games and Entertainment Industry
Paper (pdf)
T. Bürger (LFUI)

Journal of Universal Computer Science (August 2008, Volume 14, Issue 10)

The digital media and games industry is one of the biggest IT based industries worldwide. Recent observations therein showed that current production workflows may be potentially improved as multimedia objects are mostly created from scratch due to insufficient reusability capacities of existing tools. In this paper we provide reasons for that, provide a potential solution based on semantic technologies, show the potential of ontologies, and provide scenarios for the application of semantic technologies in the digital media and games industry.

Extending Voice-Driven Synthesis to Audio Mosaicing
Paper (pdf)
J. Janer, M. de Boer (UPF)

5th Sound and Music Computing Conference (August 2008, Berlin, Germany)

This paper presents a system for controlling audio mosaicing with a voice signal, which can be interpreted as a further step in voice-driven sound synthesis. Compared to voice-driven instrumental synthesis, it increases the variety in the synthesized timbre. Also, it provides a more direct interface for audio mosaicing applications, where the performer voice controls rhythmic, tonal and timbre properties of the output sound. In a first step, voice signal is segmented into syllables, extracting a set of acoustic features for each segment. In the concatenative synthesis process, the voice acoustic features (target) are used to retrieve the most similar segment from the corpus of audio sources. We implemented a system working in pseudo-realtime, which analyzes voice input and sends control messages to the concatenative synthesis module. Additionally, this work raises questions to be further explored about mapping the input voice timbre space onto the audio sources timbre space.

Temporal Attention Fusion For Sports Event Detection
Paper (pdf)
R. Ren, Y. Feng, J. Jose (UG)

The 5th International Conference on Visual Information Engineering (August 2008, Xi'an, China)

The employment of psychological measurement, attention, alleviates the semantic uncertainty around video events and leads to an effective general event detection approach. This paper proposes a multi-resolution autoregressive framework to estimate a unified attention curve from multi-modality salient features at different temporal resolutions. The highlights of this work are: (1) the capability of using data at very coarse temporal resolutions, e.g. three minutes; (2) the robustness against noise caused by modality asynchronism and feature collection size; and (3) the utilisation of Markovian temporal constrains on content presentation. This approach achieved 100% goal event coverage in the football video collection of the FIFA World Cup 2002, 2006 and UEFA League 2006.

Rule-Based Scene Boundary Detection for Semantic Video Segmentation
Paper (pdf)
Y. Feng, R. Ren, J. Jose (UG)

The 5th International Conference on Visual Information Engineering (August 2008, Xi'an, China)

In this paper, we present a novel method for semantic video segmentation by using both low-level features and high-level rules on videos and managing it in a hierarchical structure of key-frame, shot and scene. Features in color domain is calculated and utilized for detecting the key-frames and estimating the similarity between shots. By applying the predefined high-level rules, similar shots are merged and the scene boundaries are determined. Finally, a likelihood function is designed for improving the accuracy of scene boundary results. Experimental results from several Hollywood movies have demonstrated and show a better performance of both precision and recall has been achieved comparing with other existing works.

A faceted interface for multimedia search
Paper (for purchase)
R. Villa, N. Gildea, J. Jose (UG)

SIGIR 2008 - 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (July 2008, Singapore, Singapore)

With the rapid increase in online video services, video retrieval systems are becoming increasingly important search tools to many users in many different fields. In this poster we present a novel video retrieval interface, which supports the creation of multiple search "facets", to aid users carrying out complex, multi-faceted search tasks. The interface allows multiple searches to be executed and viewed simultaneously, and allows material to be reorganized between the facets. An experiment is presented which compares the faceted interface to a tabbed interface similar to that on modern web browsers, and some preliminary results are given.

A cluster-based simulation of facet-based search
Paper (for purchase)
T. Urruty, F. Hopfgartner, R. Villa, N. Gildea, J. Jose (UG)

8th ACM/IEEE-CS Joint Conference on Digital Libraries (June 2008, Pittsburgh, USA)

The recent increase of online video has challenged the research in the field of video information retrieval. Video search engines are becoming more and more interactive, helping the user to easily find what he or she is looking for. In this poster, we present a new approach of using an iterative clustering algorithm on text and visual features to simulate users creating new facets in a facet-based interface. Our experimental results prove the usefulness of such an approach.

Exploiting log files in video retrieval
Paper (for purchase)
F. Hopfgartner, T. Urruty, R. Villa, N. Gildea, J. Jose (UG)

8th ACM/IEEE-CS Joint Conference on Digital Libraries (June 2008, Pittsburgh, USA)

While research into user-centred text retrieval is based on mature evaluation methodologies, user evaluation in multimedia retrieval is still in its infancy. User evaluations can be expensive and are also often non-repeatable. An alternative way of evaluating such systems is the use of simulations. In this poster, we present an evaluation methodology which is based on exploiting log files recorded from a user-study we conducted.

A study of awareness in multimedia search
Paper (for purchase)
R. Villa, N. Gildea, J. Jose (UG)

8th ACM/IEEE-CS Joint Conference on Digital Libraries (June 2008, Pittsburgh, USA)

Awareness of another's activity is an important aspect of facilitating collaboration between users, enabling an "understanding of the activities of others". Techniques such as collaborative filtering enable a form of asynchronous awareness, providing recommendations generated from the past activity of a community of users. In this paper we investigate the role of awareness and its effect on search behavior in collaborative multimedia retrieval. We focus on the scenario where two users are searching at the same time on the same task, and via the interface, can see the activity of the other user. The main research question asks: does awareness of another searcher aid a user when carrying out a multimedia search session? To encourage awareness, an experimental study was designed where two users were asked to find as many relevant video shots as possible under different awareness conditions. These were individual search (no awareness of each other), mutual awareness (where both user's could see each other's search screen), and unbalanced awareness (where one user is able to see the other's screen, but not vice-versa). Twelve pairs of users were recruited, and the four worst performing TRECVID 2006 search topics were used as search tasks, under four different awareness conditions. We present the results of this study, followed by a discussion of the implications for multimedia digital library systems.

A User Centered Annotation Methodology for Multimedia Content
Paper (pdf)
T. Bürger, C. Ammendola (LFUI)

5th European Semantic Web Conference (June 2008, Tenerife, Spain)

Fully automated solutions for semantic annotation of multi-media content still do not deliver satisfying results. Most manual ontology-based annotation approaches are not suitable for end users who are not experienced with navigating huge ontologies or extending the ontologies used to annotate. We thus present an annotation methodology which supports the user in the aforementioned tasks. This lowers the entry-barrier for non-experienced users to produce ontology based annotations and thus could be used in situations in which annotation should happen just-in-time during the creation of the media which is being annotated.

A Benefit Estimation Model for Ontologies
Paper (pdf)
T. Bürger (LFUI)

5th European Semantic Web Conference (June 2008, Tenerife, Spain)

Predicting the economic value of ontologies is important for their use in productive environments. The measurement of the economic value of information systems usually consists of an assessment of its costs and benefits. While methods for cost estimation for ontology engineering have already been proposed, no method to quantify the benefits of the use of ontologies exists. We thus propose a method for benefit estimation that can be applied to ontologies based on a multiple gap model for user information satisfaction analysis. Together with cost estimation methods this model can be used to predict the economic value of ontologies.

Emotional speech corpus construction, annotation and distribution
Paper (pdf)
C. Cullen, B. Vaughan, S. Kousidis (DIT)

Language Resources and Evaluation Conference (LREC 2008) (May 2008, Marakech, Morocco)

This paper details a process of creating an emotional speech corpus by collecting natural emotional speech assets, analysisng and tagging them (for certain acoustic and linguistic features) and annotating them within an on-line database. The definition of specific metadata for use with an emotional speech corpus is crucial, in that poorly (or inaccurately) annotated assets are of little use in analysis. This problem is compounded by the lack of standardisation for speech corpora, particularly in relation to emotion content. The ISLE Metadata Initiative (IMDI) is the only cohesive attempt at corpus metadata standardisation performed thus far. Although not a comprehensive (or universally adopted) standard, IMDI represents the only current standard for speech corpus metadata available. The adoption of the IMDI standard allows the corpus to be re-used and expanded, in a clear and structured manner, ensuring its re-usability and usefulness as well as addressing issues of data-sparsitiy within the field of emotional speech research.

Semantic Relationships in Multi-modal Graphs for Automatic Image Annotation
Paper (for purchase)
V. Stathopoulos, J. Urban, J. Jose (UG)

30th European Conference on Information Retrieval (ECIR 2008) (March 2008, Glasgow, United Kingdom)

It is important to integrate contextual information in order to improve the inaccurate results of current approaches for automatic image annotation. Graph based representations allow incorporation of such information. However, their behaviour has not been studied in this context. We conduct extensive experiments to show the properties of such representations using semantic relationships as a type of contextual information. We also experimented with different similarity measures for semantic features and results are presented.

Towards Intelligent Assembly of Media Assets for Automated Character Animation
Paper (pdf)
M. Hausenblas, R. Mörzinger, P. Hofmair, W. Haas (JRS)

1st Workshop on Multimedia Annotation and Retrieval enabled by Shared Ontologies (December 2007, Genova, Italy)

Creating character animations manually is an expensive and laborious task. In this work we analyse the current, manual workflow of creating character animations. We derive requirements for an automated process, and propose to utilise linked open datasets for context management, along with ontologies to assemble and reuse character animations. First experiences with the prototypical implementation of the context manager are reported.

TRECVid 2007 - High Level Feature Extraction Experiments at JOANNEUM RESEARCH
Paper (pdf)
R. Mörzinger, G. Thallinger (JRS)

TRECVid Evaluation Workshop (November 2007, Gaithersburg, USA)

This paper describes our experiments for the high level feature extraction task in TRECVid 2007. We submitted the following five runs:
  • A_jr1_1: Baseline run using early fusion of all input
  • A_jr1_2: Classic early feature fusion and concept correlation
  • A_jr1_3: Classic late feature fusion
  • A_jr1_4: Late feature fusion and concept correlation
  • A_jr1_5: Early fusion of heuristically defined feature combinations
The experiments were designed to study both, the performance of various content-based features in connection with classic early and late feature fusion, the influence of manually (heuristically) selecting input feature combinations and the application of concept correlation.

Our submission made use of support vector machines based on a variety of image and video features. The results of the experiments show that four out of five runs achieved a performance above the TRECVid median, including a run with 18 out of 20 evaluated high level features equal or above the median compared with inferred average precision. The mean inferred average precision of our baseline run is 0.056. Early fusion performed slightly better than late fusion on average, although the latter produced more scores above the TRECVid median. The experiment on concept correlation generally impaired the performance and outscored the baseline only for a few features. Heuristic low-level feature combinations displayed a rather poor performance. We assume that the good baseline is due to the effective grounding of a variety of low-level visual features and the generalization capability of the SVM framework with high-dimensional feature spaces.

Glasgow University at TRECVid 2007
Paper (pdf)
R. Mörzinger, G. Thallinger (JRS)

TRECVid Evaluation Workshop (November 2007, Gaithersburg, USA)

In this paper we describe our experiments in the automatic search task of TRECVid 2007. For this we have implemented a new video search technique based on SIFT features and manual annotation. We submitted two runs, one solely based on the SIFT features with keyframe matching and the other based on adapted SIFT features for video retrieval in addition to manually annotated data.

Why Real-World Multimedia Assets Fail to Enter the Semantic Web
Paper (pdf), Presentation (pdf)
T. Bürger (LFUI), M. Hausenblas (JRS)

Semantic Authoring, Annotation and Knowledge Markup Workshop (October 2007, Whistler, Canada)

Making multimedia assets on the one hand first-class objects on the Semantic Web, while keeping them on the other hand conforming to existing multimedia standards is a non-trivial task. Most proprietary media asset formats are binary, optimized for streaming or storage. However, the semantics carried by the media assets are not accessible directly. In addition, multimedia description standards lack the expressiveness to gain a semantic understanding of the media assets.

There exists an array of requirements regarding media assets and the Semantic Web, already. Based on a critical review of these requirements we investigate how ontology languages fit into the picture. We finally analyse the usefulness of formal accounts to describe spatio-temporal aspects of multimedia assets in a practical context.

The Need for Formalizing Media Semantics in the Games and Entertainment Industry
Paper (pdf)
T. Bürger (LFUI), H. Zeiner (JRS)

I-MEDIA '07 - 1st International Conference on New Media Technology (September 2007, Graz, Austria)

The digital media and games industry is one of the biggest IT based industries worldwide. Recent observations therein showed that current production workflows may be potentially improved as multimedia objects are mostly created from scratch due to insufficient reusability capacities of existing tools. In this paper we provide reasons for that, provide a potential solution based on semantic technologies, show the potential of ontologies, and provide scenarios for the application of semantic technologies in the digital media and games industry.

Annotating Music Collections: How content-based similarity helps to propagate labels
Paper (pdf)
M. Sordo, C. Lauriel, O. Celma (UPF)

ISMIR 2007 - 8th International Conference on Music Information Retrieval (September 2007, Vienna, Austria)

In this paper we present a way to annotate music collections by exploiting audio similarity. Similarity is used to propose labels (tags) to yet unlabeled songs, based on the content-based distance between them. The main goal of our work is to ease the process of annotating huge music collections, by using content-based similarity distances as a way to propagate labels among songs.

We present two different experiments. The first one propagates labels that are related with the style of the piece, whereas the second experiment deals with mood labels. On the one hand, our approach shows that using a music collection annotated at 40% with styles, the collection can be automatically annotated up to 78% (that is, 40% already annotated and the rest, 38%, only using propagation), with a recall greater than 0.4. On the other hand, for a smaller music collection annotated at 30% with moods, the collection can be automatically annotated up to 65% (e.g. 30% plus 35% using propagation).

Discriminating Expressive Speech Styles by Voice Quality Parameterization
Paper (pdf)
C. Monzo, F. Alías, I. Iriondo, X. Gonzalvo, S. Planet (URL)

ICPhS07 - International Congress of Phonetic Sciences (August 2007, Saarbrücken, Germany)

In this work, the capability of voice quality parameters to discriminate among different expressive speech styles is analyzed. To that effect, the data distribution of these parameters, directly measured from the acoustic speech signal, is used to train a Linear Discriminant Analysis that conducts an automatic classification. As a result, the most relevant voice quality patterns for discriminating expressive speech styles are obtained for a diphone and triphone Spanish speech corpus with five expressive speaking styles: neutral, happy, sad, sensual and aggressive.

Expressive Speech Corpus Validation by Mapping Subjective Perception to Automatic Classification Based on Prosody and Voice Quality
Paper (pdf)
I. Iriondo, S. Planet, F. Alías, J.C. Socoró, F. Alías, C. Monzo, E. Martínez, E. (URL)

ICPhS07 - International Congress of Phonetic Sciences (August 2007, Saarbrücken, Germany)

This paper presents the validation of the expressive content of an acted corpus produced to be used in speech synthesis, due to this kind of emotional speech can be rather lacking in authenticity. The goal is to obtain an automatic classifier able to prune the bad utterances - from an expressiveness point of view. The results of a previous subjective test are used for training a multistage emotional identification system based on statistical features computed from the speech prosody and voice quality. Finally, the system provides a set of utterances to be checked and definitely eliminated if appropriate.

Task-Based Mood Induction Procedures for the Elicitation of Natural Emotional Responses
B. Vaughan, S. Kousidis, and Ch. Cullen (DIT)

CCCT 2007 - The 5th International Conference on Computing, Communications and Control Technologies (July 2007, Orlando, USA)

Validation of an Expressive Speech Corpus by Mapping Automatic Classification to Subjective Evaluation
Book chapter (from Springer)
I. Iriondo, S. Planet, J.C. Socoró, F. Alías, E. Martínez (URL)

IWANN 2007 - 9th International Work-Conference on Artificial Neural Networks (June 2007, San Sebastián, Spain)

This paper presents the validation of the expressive content of an acted corpus produced to be used in speech synthesis. The use of acted speech can be rather lacking in authenticity and therefore its expressiveness validation is required. The goal is to obtain an automatic classifier able to prune the bad utterances -with wrong expressiveness-. Firstly, a subjective test has been conducted with almost ten percent of the corpus utterances. Secondly, objective techniques have been carried out by means of automatic identification of emotions using different algorithms applied to statistical features computed over the speech prosody. The relationship between both evaluations is achieved by an attribute selection process guided by a metric that measures the matching between the misclassified utterances by the users and the automatic process. The experiments show that this approach can be useful to provide a subset of utterances with poor or wrong expressive content.

Extracting User Preferences by GTM for aiGA Weight Tuning in Unit Selection Text-to-Speech Synthesis
Book chapter (from Springer)
Ll. Formiga, F. Alías (URL)

IWANN 2007 - 9th International Work-Conference on Artificial Neural Networks (June 2007, San Sebastián, Spain)

Unit-selection based Text-to-Speech synthesis systems aim to obtain high quality synthetic speech by selecting previously recorded units. These units are selected by a dynamic programming algorithm guided through a weighted cost function. Weights should be tuned by means of perception from listening users to obtain proper quality. In previous works we have proposed to subjectively tune these weights through an interactive evolutionary process, also known as Active Interactive Genetic Algorithm. The problem comes out when different users, although being consistent, evolve to different weight configurations. In this proof-of-principle work, we introduce GTM as a method to extract knowledge from user specific preferences. The experiments show that GTM is able to capture user preferences, thus, avoiding selecting the best evolved weight configuration by means of a new preference test.

Enhancing CBIR Through Feature Optimization, Combination and Selection
Paper (pdf, available to IEEE subscribers)
X. Hilaire, J. Jose (UG)

CBMI 2007. International Workshop on Content-Based Multimedia Indexing (June 2007, Bordeaux, France)

We present a Content-Based Image Retrieval (CBIR) method based on the combination and selection of several image features. The novelty of our approach over existing methods is threefold: we provide a statistical optimization of the similarity distance for each feature; we replace certain features by a selection in a non-linear expansion of them; and we perform a linear combination of the features. We demonstrate superior capabilities of our method in certain cases over support vector machines (SVM) on a COREL image collection.

Simulated testing of an adaptive multimedia information retrieval system
Paper (pdf)
F. Hopfgartner, J. Urban, R. Villa, J. Jose (UG)

CBMI 2007. International Workshop on Content-Based Multimedia Indexing (June 2007, Bordeaux, France)

The Semantic Gap is considered to be a bottleneck in image and video retrieval. One way to increase the communication between user and system is to take advantage of the user’s action with a system, e.g. to infer the relevance or otherwise of a video shot viewed by the user. In this paper we introduce a novel video retrieval system and propose a model of implicit information for interpreting the user’s actions with the interface. The assumptions on which this model was created are then analysed in an experiment using simulated users based on relevance judgements to compare results of explicit and implicit retrieval cycles. Our model seems to enhance retrieval results. Results are presented and discussed in the final section.

HMM-Based Spanish Speech Synthesis Using CBR as F0 Estimator
Paper (pdf)
X. Gonzalvo, I. Iriondo, J.C. Socoró, F. Alías, C. Monzo (URL)

NOLISP 2007 - An ISCA Tutorial and Research Workshop on NOn LInear Speech Processing (May 2007, Paris, France)

Hidden Markov Models based text-to-speech (HMM-TTS) synthesis is a technique for generating speech from trained statistical models where spectrum, pitch and durations of basic speech units are modelled altogether. The aim of this work is to describe a Spanish HMM-TTS system using CBR as a F0 estimator, analysing its performance objectively and subjectively. The experiments have been conducted on a reliable labelled speech corpus, whose units have been clustered using contextual factors according to the Spanish language. The results show that the CBR-based F0 estimation is capable of improving the HMM-based baseline performance when synthesizing nondeclarative short sentences and reduced contextual information is available.

Objective and Subjective Evaluation of an Expressive Speech Corpus
I. Iriondo, S. Planet, J.C. Socoró, F. Alías (URL)

NOLISP 2007 - An ISCA Tutorial and Research Workshop on NOn LInear Speech Processing (May 2007, Paris, France)

This paper presents the validation of the expressiveness of an acted oral corpus produced to be used in speech synthesis. Firstly, an objective validation has been conducted by means of automatic emotion identification techniques using statistical features extracted from the prosodic parameters of speech. Secondly, a listening test has been performed with a subset of utterances. The relationship between both objective and subjective evaluations is analyzed and the obtained conclusions can be useful to improve the following steps related to expressive speech synthesis.

VAMP: Semantic Validation for MPEG-7 Profile Descriptions
Technical Report (pdf)
R. Troncy (Centrum voor Wiskunde en Informatica), W. Bailer, M. Hausenblas, M. Höffernig (JRS)

Technical report published by Centrum voor Wiskunde en Informatica, INS - Information Systems (April 2007, Amsterdam, Netherlands)

MPEG-7 can be used to create complex and comprehensive metadata descriptions of multimedia content. Since MPEG-7 is defined in terms of an XML schema, the semantics of its elements has no formal grounding. In addition, certain features can be described in multiple ways. MPEG-7 profiles are subsets of the standard that apply to specific application areas and that aim to reduce this syntactic variability, but they still lack formal semantics. We propose an approach for expressing the semantics explicitly by formalizing the constraints of various profiles using ontologies and logical rules, thus enabling interoperability and automatic use for MPEG-7 based applications. We have implemented VAMP, a full semantic validation service that detects any inconsistencies of the semantic constraints formalized. Another contribution of this paper is an analysis of how MPEG-7 is practically used. We report on experiments about the semantic validity of MPEG-7 descriptions produced by numerous tools and projects and we categorize the most common errors found.

Prosody Modelling of Spanish for Expressive Speech Synthesis
i. Iriondo, J.C. Socoró, F. Alías (URL)

ICASSP'07 - International Conference on Acoustic, Speech, and Signal Processing (April 2007, Hawaii, USA)

This paper presents the use of analogical learning, in particular case-based reasoning, for the automatic generation of prosody from text, which is automatically tagged with prosodic features. This is a corpus-based method for quantitative modelling of prosody to be used in a Spanish text to speech system. The main objective is the development of a method for predicting the three main prosodic parameters: the fundamental frequency (F0) contour, the segmental duration and energy. Both objective and subjective experiments have been conducted in order to evaluate the accuracy of our proposal.

Content-Based Audio Search: From Fingerprinting to Semantic Audio Retrieval
Dissertation (pdf)
P. Cano (UPF)

Dissertation at the Pompeu Fabra University (2007, Barcelona, Spain)

This dissertation is about audio content-based search. Specifically, it is on exploring promising paths for bridging the semantic gap that currently prevents wide deployment of audio content-based search engines. Music search sound engines rely on metadata, mostly human generated, to manage collections of audio assets. Even though time-consuming and error-prone, human labeling is a common practice. Audio content-based methods, algorithms that automatically extract description from audio files, are generally not mature enough to provide the user friendly representation that users demand when interacting with audio content. Mostly, content-based methods provide low-level descriptions, while high-level or semantic descriptions are beyond current capabilities.

Spectral Processing of the Singing Voice
Dissertation (pdf)
A. Loscos (UPF)

Dissertation at the Pompeu Fabra University (2007, Barcelona, Spain)

This dissertation is centered on the digital processing of the singing voice, more concretely on the analysis, transformation and synthesis of this type of voice in the spectral domain, with special emphasis on those techniques relevant for music applications.

The digital signal processing of the singing voice became a research topic itself since the middle of last century, when first synthetic singing performances were generated taking advantage of the research that was being carried out in the speech processing field. Even though both topics overlap in some areas, they present significant differentiations because of (a) the special characteristics of the sound source they deal and (b) because of the applications that can be built around them. More concretely, while speech research concentrates mainly on recognition and synthesis; singing voice research, probably due to the consolidation of a forceful music industry, focuses on experimentation and transformation; developing countless tools that along years have assisted and inspired most popular singers, musicians and producers. The compilation and description of the existing tools and the algorithms behind them are the starting point of this thesis.

SALERO: Semantic Audiovisual Entertainment Reusable Objects
Paper (pdf), Poster (pdf)
W. Haas, G. Thallinger (JRS), P. Cano (UPF), Ch. Cullen (DIT), T. Bürger (LFUI)

1st International Conference on Semantic and Digital Media Technologies - SAMT 2006 (December 2006, Athens, Greece)

The Integrated Project SALERO aims to advance the state of the art in digital media to the point where it becomes possible to create audiovisual content for cross-platform delivery using intelligent content tools, with greater quality at lower cost, to provide audiences with more engaging entertainment and information at home or on the move. SALERO will build on and extend research in media technologies, web semantics and context based image retrieval, to reverse the trend toward ever-increasing cost of creating media.

Modelado y estimación de la prosodia mediante razonamiento basado en casos (Modelling and Estimation of Prosody by Means of Case-Based Reasoning)
Paper (pdf, Spanish language)
I. Iriondo, J.C. Socoró, L. Formiga, X. Gonzalvo, F. Alías, P. Miralles (URL)

IV Jornadas en Tecnología del Habla (November 2006, Zaragoza, Spain)

This paper presents the use of analogical learning, in particular case-based reasoning, for the automatic generation of prosody from text, which is automatically tagged with prosodic features. This is a corpus-based method for quantitative modeling of prosody to be used in a Spanish text to speech system. The main objective is the development of a method for predicting the three main prosodic parameters: the fundamental frequency (F0) contour, the segmental duration and energy. Both objective and subjective experiments have been conducted in order to evaluate the accuracy of our proposal.

Estudio de Heurísticas para la implementación de A* en CTH basados en selección de unidades (Heuristics for Implementing the A* Algorithm for Unit Selection TTS Synthesis Systems)
Paper (pdf, Spanish language)
L. Formiga, F. Alías (URL)

IV Jornadas en Tecnología del Habla (November 2006, Zaragoza, Spain)

The Unit Selection based Text to Speech Systems (USTTS) need to perform an optimal search of units in a speech-corpus, hence in order to obtain a high-quality synthesis. This search, until nowadays, has been carried out by a Viterbi algorithm. Our work replaces the formerly used algorithm for the A* algorithm to enhance its computational efficiency. With that goal, a review of previous work that intend this substitution is detailed. Afterwards, a benchmark is defined to score its efficiency and results are analyzed to validate, in the last step, its theoretical argumentation.

Generation of High Quality Audio Natural Emotional Speech Corpus using Task Based Mood Induction
Paper (pdf)
Ch. Cullen, B. Vaughan, S. Kousidis, Y. Wang, C. McDonnell, D. Campbell (DIT)

1st International Conference on Multidisciplinary Information Sciences and Technologies (October 2006, Mérida, Spain)

Detecting emotional dimensions in speech is an area of great research interest, notably as a means of improving human computer interaction in areas such as speech synthesis. In this paper, a method of obtaining high quality emotional audio speech assets is proposed. The methods of obtaining emotional content are subject to considerable debate, with distinctions between acted and natural speech being made based on the grounds of authenticity. Mood Induction Procedures (MIP’s) are often employed to stimulate emotional dimensions in a controlled environment. This paper details experimental procedures based around MIP 4, using performance related tasks to engender activation and evaluation responses from the participant. Tasks are specified involving two participants, who must co-operate in order to complete a given task within the allotted time. Experiments designed in this manner also allow for the specification of high quality audio assets (notably 24bit/192Khz), within an acoustically controlled environment, thus providing means of reducing unwanted acoustic factors within the recorded speech signal. Once suitable assets are obtained, they will be assessed for the purposes of segregation into differing emotional dimensions. The most statistically robust method of evaluation involves the use of listening tests to determine the perceived emotional dimensions within an audio clip. In this experiment, the FeelTrace rating tool is employed within user listening tests to specify the categories of emotional dimensions for each audio clip.

The Use of Task Based Mood-Induction Procedures to Generate High Quality Emotional Assets.
Poster (pdf)
B. Vaughan, Ch. Cullen, S. Kousidis , Y. Wang , C. McDonnell, D. Campbell (DIT)

IT&T - Information Technology and Telecommunications Conference (October 2006, Carlow, Ireland)

Detecting emotion in speech is important in advancing human-computer interaction, especially in the area of speech synthesis. This poster details experimental procedures based on Mood Induction Procedure 4, using performance related tasks to engender natural emotional responses in participants. These tasks are aided or hindered by the researcher to illicit the desired emotional response. These responses will then be recorded and their emotional content graded to form the basis of an emotional speech corpus. This corpus will then be used to develop a rule-set for basic emotional dimensions in speech.

Groovator - An Implementation of Real-Time Rhythm Transformations
Paper (pdf)
J. Janer, J. Bonada, S. Jordà (UPF)

121st AES Convention (October 2006, San Francisco, USA)

This paper describes a real-time system for rhythm manipulation of polyphonic audio signals. A rhythm analysis module extracts information of tempo and beat location. Based on this rhythm information, we apply different transformations: Tempo, Swing, Meter and Accent. This type of manipulation is generally referred as Content-based transformations. We address characteristics of the analysis and transformation algorithms. In addition, user interaction plays also an important role in this system. Tempo variations can be controlled either by tapping the rhythm with a MIDI interface or by using an external audio signal such as percussion or the voice as tempo control. We will conclude pointing out several use-cases, focusing on live performance situations.

Esophageal Voice Enhancement by Modeling Radiated Pulses in Frequency Domain
Paper (pdf)
A. Loscos, J. Bonada (UPF)

121st AES Convention (October 2006, San Francisco, USA)

Altough esophageal speech has demonstrated to be the most popular voice recovering method after laryngectomy surgery, it is difficult to master and shows a poor degree of intelligibility. This article proposes a new method for esophageal voice enhancement using speech digital signal processing techniques based on modeling radiated voice pulses in frequency domain. The analysis-transformation-synthesis technique creates a non-pathological spectrum for those utterances featured as voiced and filters those unvoiced. Healthy spectrum generation implies transforming the original timbre, modeling harmonic phase coupling from the spectral shape envelope, and deriving pitch from frame energy analysis. Resynthesized speech aims to improve intelligibility, minimize artificial artifacts, and acquire resemblance to patient’s pre-surgery original voice.

A Corpus with Teeth
Presentation (pdf)
D. Campbell, M. Meinardi, B. Richardson, C. Mcdonnell (DIT)

EUROCALL Conference (September 2006, Granada, Spain)
ReCALL Journal (Vol 19, No. 1, January 2007, University of Hull, United Kingdom)

This paper outlines the ongoing construction of a speech corpus for use by applied linguists and advanced EFL/ESL students.

The first section establishes the need for improvements in the teaching of listening skills and pronunciation practice for EFL/ESL students. It argues for the need to use authentic native-to-native speech in the teaching/learning process so as to promote social inclusion and contextualises this within the literature, based mainly on the work of Swan, Brown and McCarthy.

The second part addresses features of native speech flow which cause difficulties for EFL/ESL students (Brown, Cauldwell) and establishes the need for improvements in the teaching of listening skills. Examples are given of reduced forms characteristic of relaxed native speech, and how these can be made accessible for study using the Dublin Institute of Technology’s slow-down technology, which gives students more time to study native speech features, without tonal distortion.

The final section introduces a novel Speech Corpus being developed at DIT. It shows the limits of traditional corpora and outlines the general requirements of a Speech Corpus. This tool - which will satisfy the needs of teachers, learners and researchers - will link digitally recorded, natural, native-to-native speech so that each transcript segment will be linked to its associated sound file. Users will be able to locate desired speech strings, play, compare and contrast them - and slow them down for more detailed study.

A Pitch Marks Filtering Algorithm based on Restricted Dynamic Programming
Paper (pdf)
F. Alías, C. Monzo, J.C. Socoró (URL)

InterSpeech2006 -International Conference on Spoken Language Processing (ICSLP) (September 2006, Pittsburgh, USA)

In this paper, a generic pitch marks filtering algorithm (PMFA) is introduced in order to achieve reliable and smooth pitch marks from any input pitch tracking or marking algorithm. The proposed PMFA is a simple yet effective filtering process based on restricted dynamic programming, but very helpful for minimizing human intervention when creating large speech corpora. Moreover, this work introduces a novel pitch marking evaluation measure for directly comparing pitch marking algorithms with different location criteria. The experiments demonstrate that the proposed PFMA improves the results of the input state-of-the-art pitch tracking and marking algorithms dramatically.

Current Perspectives on Music Technologies & Multimedia
Presentation (pdf)
G. Holmberg (UPF)

ENGAGE 2006 (September 2006, Jakarta, Indonesia)

Within a near future, when the analogue radio & TV net is closed down, we will most probably have in our home some kind of digital Home Entertainment Platform/Media Center. And even to a greater extent than today, we will carry with us portable media players & storage devises. A true digital revolution will radically alter our behavior with multimedia objects, such as music & audio. We will have constant access to Internet, with all music & media of all times and origins available. This will necessarily require on the one hand new and advanced methods of search & retrieval. This is the field of MIR (Music Information Retrieval) and Audio Content Analysis. And on the other hand, we have the field of Audio Transformation & Synthesis: you will no longer be restricted to only download & passively press "play". You will be able to interact with media objects, such as play the song in a different key; or slower/faster; suppress vocals and sing-along & you will be able to remix & play around with music and broadcast yourself and easily create new, personalized "versions" of the media object. We believe that the boundary between professional audio & media creation technology and home-entertainment is just about to merge, into an explosion of breath-taking technological developments & human creative power.

Transcripción fonética de acrónimos en castellano utilizando el algoritmo C4.5 (Phonetic Transcription of Spanish Acronyms by using C4.5 algorithm)
Paper (pdf, Spanish language)
C. Monzo, F. Alías, J.A. Morán, X. Gonzalvo (URL)

XXII Congreso de la SEPLN (September 2006, Zaragoza, Spain)

This work presents an automatic acronyms transcription system in order to increase the synthetic speech quality of text-to-speech systems, in the presence of acronyms in the input text. The acronyms transcription is conducted by using a decision tree (C4.5 algorithm). The work presents the results obtained for different algorithm configurations, validating its performance with respect to other learning systems.

Letting the Corpus Speak
Presentation (pdf)
D. Campbell (DIT)

IVACS - Inter Varietal Corpus Studies (June 2006, Limerick, Ireland)

This presentation outlines the current state of development of DIT’s nascent speech corpus. This will allow a body of spoken material to be searched for features of informal native speech via a normalised transcription. Once located, the original sound files can be played at normal speed or slowed down in order to better study the speech act itself. That this aspect of language learning has been neglected for decades has frequently been lamented by natural language specialists such as Richard Cauldwell.

Let the Corpus Speak!
Presentation (pdf)
D. Campell (DIT)

40th IATEFL Annual Conference and Exhibition (April 2006, Harrogate, United Kingdom)

This presentation contrasts existing corpora with the novel Speech Corpus being developed at DIT. It points up the limits of existing - written, and even spoken - corpora and outline the general requirements of a Speech Corpus. This tool - which will satisfy the
needs of teachers, learners and researchers- will link digitally recorded, natural, native-native speech acts (in WAV format) with their idealised, orthographic transcriptions. The transcriptions can be fed through a concordancer, with each transcript segment linked to its
associated sound file. The segments will also be be tagged for speed of delivery, which will allow users to locate the desired speech strings, play them, compare and contrast them, and - if necessary - slow them down for more detailed study.

SALERO: Semantic Audiovisual Entertainment Reusable Objects
Abstract (pdf), Poster (pdf)
W. Haas, G. Thallinger (JRS)

2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies (November 2005, London, United Kingdom)

Ever since the idea of convergence was floated, the media industry has been talking about cross-platform exploitation as a way of producing more exciting content more cost-effectively. But while technology has helped to produce better quality sounds and images, the costs continue to rise. It is virtually impossible to re-use items from previous productions (regardless of issues of copyright) in different contexts, as the majority of sounds and images only work in the context and media type for which they were originally made.
Sitemap