home

Workshop on Advanced Signal Processing

23 al 25 de Noviembre de 2021

Este es un evento de acceso abierto

YouTube Live Streaming

WASP 2022

Workshop on Advanced Signal Processing, Machine Learning and Modeling

Se desarrollarán 3 días de actividades, con charlas plenarias, y la participación de disertantes nacionales e internacionales.

Este evento es totalmente gratuito, y no requiere de inscripción previa.

El evento está planificado en formato híbrido, con disertantes presenciales y por videoconferencia. El evento se transmitirá en vivo por streaming

Temas

Procesamiento de Señales

Análisis tiempo-frecuencia
Synchrosqueezing
Wave-Shape Functions
Signal Decomposition

Aprendizaje Maquinal

Medical Imaging
Image Restoration
Regularization
Generative Neural Networks

Modelización

Voice Models
F0 contour modeling
Vocal Function
Acoustic Modeling

Disertantes

Matías Zañartu

Universidad Técnica Federico Santa María - Chile

Model based estimation of physiological parameters for the clinical assessment of vocal function

In this talk, we summarize different efforts in our group to estimate physiological parameters that are clinically relevant for the assessment of vocal function, but difficult, if not impossible, to directly measure. Two approaches are discussed. First, a Bayesian framework to estimate model parameters from a subject-specific, lumped-element vocal fold model is utilized to estimate subglottal pressure, laryngeal muscle activation, and vocal fold contact pressure. The framework is based on an extended Kalman filter and a voice production model featuring a  body-cover model of the vocal folds and rules linking model parameters with intrinsic laryngeal muscle activation. The observation data is taken from calibrated transnasal high-speed videoendoscopy and oral airflow data, which are processed to compute glottal area and glottal airflow in physical units. The second approach also provides the same physiological estimates but in the context of ambulatory voice monitoring and it uses a machine learning framework trained using a voice production model. A triangular body-cover model of the vocal folds with coordinated activation of five intrinsic laryngeal muscles is now used, and the observation signal is a neck-surface accelerometer. This signal is first processed using subglottal impedance-based inverse filtering to yield an estimate of the unsteady glottal airflow. Seven aerodynamic and acoustic features are extracted from the neck surface accelerometer. A neural network architecture is selected to provide a mapping between the seven input features and subglottal pressure, vocal fold contact pressure, and cricothyroid and thyroarytenoid muscle activation. The experimental data used for validation of both approaches includes laryngeal high-speed videoendoscopy, aerodynamic, acoustic, and neck surface acceleration recordings, intramuscular electromyography, high density surface electromyography, kinematic and aerodynamics recordings using silicone vocal fold models, and separate modeling approaches, such as finite element models. This unique dataset can be used to comprehensively assess the relevance of the physiological estimates obtained from the proposed methods.

SYLVAIN MEIGNEN

Laboratoire Jean Kuntzmann
- Francia

Synchrosqueezing transforms: From low to high frequency modulation, interference, noise and perspectives.

The general aim of this talk is to introduce the concept of synchrosqueezing transforms (SSTs) that was developed to sharpen linear time–frequency representations (TFRs), like the short-time Fourier or the continuous wavelet transforms, in such a way that the sharpened transforms remain invertible. This property is of paramount importance when one seeks to recover the modes of a multicomponent signal (MCS), corresponding to the superimposition of AM/FM modes, a model often used in many practical situations. After having recalled the basic principles of SST and explained why, when applied to an MCS, it works well only when the modes making up the signal are slightly modulated, we focus on how to circumvent this limitation. We then give illustrations in practical situations either associated with gravitational wave signals or modes with fast oscillating frequencies and discuss how SST can be used in conjunction with a demodulation operator, extending existing results in that matter.

In a second part of the talk, we will propose to study in more details the reassignment operators used in the different synchrosqueezing transforms in the case of interfering modes or in

the presence of noise, to find out some ways of improving the instantaneous frequency and chirp rate estimation in such circumstances.

We will finally end the talk by giving some new perspectives in that research field.

Mahadeva Prassana

Indian Institute of Technology Dharwad - India

Significance of Excitation Source Information for Pathological Speech Processing

The excitation source plays an important role in pathological speech processing. The same can be used for the analysis, processing and extraction of some useful information. This talk will present some of the works done in the recent past using excitation source information for pathological speech processing.

In this talk, we summarize different efforts in our group to estimate physiological parameters that are clinically relevant for the assessment of vocal function, but difficult, if not impossible, to directly measure. Two approaches are discussed. First, a Bayesian framework to estimate model parameters from a subject-specific, lumped-element vocal fold model is utilized to estimate subglottal pressure, laryngeal muscle activation, and vocal fold contact pressure. The framework is based on an extended Kalman filter and a voice production model featuring a  body-cover model of the vocal folds and rules linking model parameters with intrinsic laryngeal muscle activation. The observation data is taken from calibrated transnasal high-speed videoendoscopy and oral airflow data, which are processed to compute glottal area and glottal airflow in physical units. The second approach also provides the same physiological estimates but in the context of ambulatory voice monitoring and it uses a machine learning framework trained using a voice production model. A triangular body-cover model of the vocal folds with coordinated activation of five intrinsic laryngeal muscles is now used, and the observation signal is a neck-surface accelerometer. This signal is first processed using subglottal impedance-based inverse filtering to yield an estimate of the unsteady glottal airflow. Seven aerodynamic and acoustic features are extracted from the neck surface accelerometer. A neural network architecture is selected to provide a mapping between the seven input features and subglottal pressure, vocal fold contact pressure, and cricothyroid and thyroarytenoid muscle activation. The experimental data used for validation of both approaches includes laryngeal high-speed videoendoscopy, aerodynamic, acoustic, and neck surface acceleration recordings, intramuscular electromyography, high density surface electromyography, kinematic and aerodynamics recordings using silicone vocal fold models, and separate modeling approaches, such as finite element models. This unique dataset can be used to comprehensively assess the relevance of the physiological estimates obtained from the proposed methods.

The general aim of this talk is to introduce the concept of synchrosqueezing transforms (SSTs) that was developed to sharpen linear time–frequency representations (TFRs), like the short-time Fourier or the continuous wavelet transforms, in such a way that the sharpened transforms remain invertible. This property is of paramount importance when one seeks to recover the modes of a multicomponent signal (MCS), corresponding to the superimposition of AM/FM modes, a model often used in many practical situations. After having recalled the basic principles of SST and explained why, when applied to an MCS, it works well only when the modes making up the signal are slightly modulated, we focus on how to circumvent this limitation. We then give illustrations in practical situations either associated with gravitational wave signals or modes with fast oscillating frequencies and discuss how SST can be used in conjunction with a demodulation operator, extending existing results in that matter.

In a second part of the talk, we will propose to study in more details the reassignment operators used in the different synchrosqueezing transforms in the case of interfering modes or in

the presence of noise, to find out some ways of improving the instantaneous frequency and chirp rate estimation in such circumstances.

We will finally end the talk by giving some new perspectives in that research field.

The excitation source plays an important role in pathological speech processing. The same can be used for the analysis, processing and extraction of some useful information. This talk will present some of the works done in the recent past using excitation source information for pathological speech processing.

LEANDRO DI PERSIA

Universidad Nacional del Litoral
- Argentina

Machine learning for source separation, dereverberation, enhancement and source localization

One of my research lines in the last 15 years has been Speech and Audio Processing. In this talk, I will present a brief review of machine learning approaches that have been applied to different problems in this area. It will cover from traditional methods like Markovian models (HMM-HMT) applied to Denoising, to representation approaches like ICA, NMF and dictionary learning, applied to Source separation, dereverberation and enhancement, to the more recent use of deep learning (LSTM and U-NETS) applied to denoising and virtual source generation/localization.

Víctor M. Espinoza

Universidad de Chile
- Chile

Evaluación objetiva de hiperfuncionalidad vocal usando modelamiento acústico y procesamiento de audio

En esta charla, se revisarán los aspectos fundamentales de mi linea de investigación relacionada a la evaluación objetiva de la hiperfuncionalidad vocal. El énfasis de la charla estará en como el uso de herramientas de modelado acústico y de señales biométricas vibroacústicas de la voz, y su procesado de señal, han contribuido en recientes años a diferenciar la presencia en etapas tempranas de desarrollo de patologías vocales que pueden llegar a ser crónicas e invalidantes para las personas afectadas. Se presentarán también resultados de estudios realizados con el enfoque mencionado anteriormente junto con los actuales desafíos y futuras direcciones en el área para investigación.

THOMAS OBERLIN

Université de Toulouse
- Francia

Regularization of image restoration problems with generative neural networks

In this talk, I will present a new way of regularizing an inverse problem in imaging (e.g., deblurring or inpainting) by means of a generative neural network. Compared to end-to-end models, such approaches seem particularly interesting since the same network can be used for many different problems and experimental conditions, as soon as the generative model is suited to the data. Previous works proposed to use a synthesis framework, where the estimation is performed on the latent vector, the solution being obtained afterwards via the decoder. Instead, I will present an analysis formulation where we directly optimize in the image space. I will illustrate the interest of such a formulation by showing experiments of inpainting, deblurring and super-resolution with different neural architectures. In many cases our technique achieves a clear improvement of the performance and seems to be more robust, in particular with respect to initialization.

Website : https://personnel.isae-supaero.fr/thomas-oberlin/

References :
Bora, A., Jalal, A., Price, E., & Dimakis, A. G. Compressed sensing using generative models. Proc. ICML 2017

Oberlin, T., & Verm, M. Regularization via deep generative models: an analysis point of view. To appear in ICIP 2021,

https://arxiv.org/abs/2101.08661

One of my research lines in the last 15 years has been Speech and Audio Processing. In this talk, I will present a brief review of machine learning approaches that have been applied to different problems in this area. It will cover from traditional methods like Markovian models (HMM-HMT) applied to Denoising, to representation approaches like ICA, NMF and dictionary learning, applied to Source separation, dereverberation and enhancement, to the more recent use of deep learning (LSTM and U-NETS) applied to denoising and virtual source generation/localization.

En esta charla, se revisarán los aspectos fundamentales de mi linea de investigación relacionada a la evaluación objetiva de la hiperfuncionalidad vocal. El énfasis de la charla estará en como el uso de herramientas de modelado acústico y de señales biométricas vibroacústicas de la voz, y su procesado de señal, han contribuido en recientes años a diferenciar la presencia en etapas tempranas de desarrollo de patologías vocales que pueden llegar a ser crónicas e invalidantes para las personas afectadas. Se presentarán también resultados de estudios realizados con el enfoque mencionado anteriormente junto con los actuales desafíos y futuras direcciones en el área para investigación.

In this talk, I will present a new way of regularizing an inverse problem in imaging (e.g., deblurring or inpainting) by means of a generative neural network. Compared to end-to-end models, such approaches seem particularly interesting since the same network can be used for many different problems and experimental conditions, as soon as the generative model is suited to the data. Previous works proposed to use a synthesis framework, where the estimation is performed on the latent vector, the solution being obtained afterwards via the decoder. Instead, I will present an analysis formulation where we directly optimize in the image space. I will illustrate the interest of such a formulation by showing experiments of inpainting, deblurring and super-resolution with different neural architectures. In many cases our technique achieves a clear improvement of the performance and seems to be more robust, in particular with respect to initialization.

Website : https://personnel.isae-supaero.fr/thomas-oberlin/

References :
Bora, A., Jalal, A., Price, E., & Dimakis, A. G. Compressed sensing using generative models. Proc. ICML 2017

Oberlin, T., & Verm, M. Regularization via deep generative models: an analysis point of view. To appear in ICIP 2021,

https://arxiv.org/abs/2101.08661

Jeremias Sulam

Johns Hopkins University
- Maryland, USA

Data driven methods for inverse problems in neuroimaging

Abstract to be confirmed.

Humberto Torres

INIGEM CONICET - UBA
- Argentina

F0 contour modelling

Intonation is one of the most important prosody attributes of natural speech, carrying linguistic, paralinguistic and non-linguistic information. Fundamental frequency (F0) contour and pauses are the two most important physical correlates of intonation. A meticulous description of the F0 contour is useful both for understanding its relationships to the underlying information and for its use in applications such as high-quality speech synthesis.
I will introduce some of our works on modeling the F0 contour for Argentine Spanish. In particular, the works related to a model developed by Professor Hiroya Fujisaki, from my first contact with this model to the works that we are doing nowadays and those that we are planning for the near future. Fujisaki’s model of intonation has been successfully tested for different languages, it stands out for its simplicity and strong physiological basis. This model parameterizes F0 contours in an efficient manner: with a small number of parameters we can achieve a desired level of fitting accuracy. It is not my intention to discuss the philosophy behind this model, what I would like to raise are its physical and physiological foundations.

Marcelo A. Colominas

IBB - CONICET - UNER
- Argentina

Signal decomposition, segmentation and denoising with time-varying wave-shape functions

Modern time series are usually composed of multiple oscillatory components, with time-varying frequency and amplitude contaminated by noise. The signal processing mission is further challenged if each component has an oscillatory pattern, or the wave-shape function, far from a sinusoidal function, and the oscillatory pattern is even changing from time to time. In practice, if multiple components exist, it is desirable to robustly decompose the signal into each component for various purposes, and extract desired dynamics information. Such challenges have raised a significant amount of interest in the past decade, but a satisfactory solution is still lacking. We will present a novel nonlinear regression scheme to robustly decompose a signal into its constituting multiple oscillatory components with time-varying frequency, amplitude and wave-shape function. We coined the algorithm shape-adaptive mode decomposition (SAMD).
Then, I will present an algorithm to estimate multiple wave-shape functions (WSF) from a nonstationary oscillatory signal with time-varying amplitude and frequency. Suppose there are finite different periodic functions, as WSFs that model different oscillatory patterns in an oscillatory signal, where the WSF might jump from one to another suddenly. The proposed algorithm detects change points and estimates from the signal by a novel iterative warping and clustering algorithm, which is a combination of time-frequency analysis, singular value decomposition entropy and vector spectral clustering.

Abstract to be confirmed.

Intonation is one of the most important prosody attributes of natural speech, carrying linguistic, paralinguistic and non-linguistic information. Fundamental frequency (F0) contour and pauses are the two most important physical correlates of intonation. A meticulous description of the F0 contour is useful both for understanding its relationships to the underlying information and for its use in applications such as high-quality speech synthesis.
I will introduce some of our works on modeling the F0 contour for Argentine Spanish. In particular, the works related to a model developed by Professor Hiroya Fujisaki, from my first contact with this model to the works that we are doing nowadays and those that we are planning for the near future. Fujisaki’s model of intonation has been successfully tested for different languages, it stands out for its simplicity and strong physiological basis. This model parameterizes F0 contours in an efficient manner: with a small number of parameters we can achieve a desired level of fitting accuracy. It is not my intention to discuss the philosophy behind this model, what I would like to raise are its physical and physiological foundations.

Modern time series are usually composed of multiple oscillatory components, with time-varying frequency and amplitude contaminated by noise. The signal processing mission is further challenged if each component has an oscillatory pattern, or the wave-shape function, far from a sinusoidal function, and the oscillatory pattern is even changing from time to time. In practice, if multiple components exist, it is desirable to robustly decompose the signal into each component for various purposes, and extract desired dynamics information. Such challenges have raised a significant amount of interest in the past decade, but a satisfactory solution is still lacking. We will present a novel nonlinear regression scheme to robustly decompose a signal into its constituting multiple oscillatory components with time-varying frequency, amplitude and wave-shape function. We coined the algorithm shape-adaptive mode decomposition (SAMD).
Then, I will present an algorithm to estimate multiple wave-shape functions (WSF) from a nonstationary oscillatory signal with time-varying amplitude and frequency. Suppose there are finite different periodic functions, as WSFs that model different oscillatory patterns in an oscillatory signal, where the WSF might jump from one to another suddenly. The proposed algorithm detects change points and estimates from the signal by a novel iterative warping and clustering algorithm, which is a combination of time-frequency analysis, singular value decomposition entropy and vector spectral clustering.

Juliana Codino

Lakeshore Professional Voice Center - MIchigan, USA

Challenges in the clinical setting in voice signal processing

The world is witnessing exponential growth in the field of research in human voice. In the last two decades, laryngology has grown as a discipline in itself becoming independent from otolaryngology. The  evolution of this subspecialty highlights the need for patient-centered evaluation tools, to delve into basic science research, and to have evidence-based voice therapy techniques.

Acoustic and aerodynamic measurements, signal processing, the use of mathematical models, high-speed imaging and artificial intelligence, are all methods that are currently applied in different areas and by different specialties, with
the aim to achieve a better understanding of vocal function.

In the field of Speech-Language Pathology specialized in voice disorders, the study of voice signals is paramount for adequate treatment and follow-up.

This talk proposes to discuss some of the challenges faced in the clinical setting in this new emerging field. We will discuss certain limitations in some of the evaluation tools that are of standard use in a voice center, with the hopes to inspire researchers to continue exploring the clinical usefulness in voice signal processing in the care of the dysphonic patient and bridge the gap between basic science and clinical application.

Gabriel Alzamendi

IBB - CONICET - UNER
- Argentina

Physiological modeling with applications in inverse problems of phonation

Physiological modeling has proved a valuable tool for gaining insights into the mechanisms underlying both normal and impaired human phonation. Nowadays, diverse computational models are available to describe and simulate the phenomena and the physical interactions involved in voice production. Two alternatives are especially appealing, combining physiological relevance and low computational burden; the lumped-elements biomechanical modeling of vocal function and the digital waveguide acoustic-tube simulation of the vocal tract. They provide the basis for developing physiologically-based voice synthesizers, thus allowing the investigation of the acoustic effects due to alterations in the glottal function. In recent years, physiological modeling has also drawn the attention of researchers for formulating and solving different inverse problems in the context of phonation. Innovative ideas have arisen from computational methods and models of vocal behavior for estimating relevant clinical information describing the pathophysiology of vocal organs from measured clinical data. The works in our research group focus on combining physiological modeling and signal processing techniques for the clinical assessment of human phonation. In this talk, we will introduce different ideas explored in our group to improve the clinical relevance of low-dimensional models and computational methods well-established for voice assessment. We will discuss a state-space structural representation and analysis scheme for the perturbed pitch period contours in voice signals, new model-based techniques for glottal inverse filtering and acoustic source estimation, a Bayesian estimator of vocal function measures based on the Extended Kalman filter and a biomechanical body-cover model of the vocal folds, and a muscle-controlled physiological voice synthesizer considering the effect of all the five intrinsic laryngeal muscles.

Hugo L. Rufiner

SINC (FICH-UNL) - CONICET - UNER
- Argentina

Artificial Systems for Cognitive Speech Processing

Speech is one of the most natural forms of communication between human beings. Speech signals convey much information about the speaker beyond words, such as emotional and health status, identity, age, gender and even height, to name a few. Human speech communication involves the interaction and coordination of various body parts that produce many signals that can be properly measured and analyzed. Examples of these signals related to the speech production or perception process are muscle signals (EMG), brain activities (EEG, ECoG, fNIRS, MEG, etc.), electroglottogram (EGG) and video recording. The complex processing that takes place at the brain level plays a crucial role in both the production and perception of speech signals. If the cognitive processes involved are better understood, it will be possible to develop more effective artificial methods of communication. Brain signals are also particularly interesting because they are also used for brain-computer interfaces (BCI). In this context, new BCI paradigms related to speech communication were proposed, such as imagined or inner speech. For example, during imagined speech, subjects have to imagine themselves uttering words but without moving muscles nor producing sounds. Some research has been done on the classification of vowels, syllables and whole words using the brain signals acquired during these speech-related paradigms, with promising results for this task that can be called “Speak what you thought”. Another interesting line of research achieved is the reconstruction of recognizable speech from the direct recording of listeners’ brain signals, thus demonstrating the feasibility of developing systems that can “Hear what you hear”. The significant advances of the last decade in computational intelligence, such as the development of several new deep learning architectures together with the availability of the vast amount of experimental data, are responsible for reaching new levels of performance on such challenging tasks. This talk proposes a comprehensive and up-to-date overview of this new emerging field and discusses some of the challenges that need to be addressed to further improve the performance of this technology.

The world is witnessing exponential growth in the field of research in human voice. In the last two decades, laryngology has grown as a discipline in itself becoming independent from otolaryngology. The  evolution of this subspecialty highlights the need for patient-centered evaluation tools, to delve into basic science research, and to have evidence-based voice therapy techniques.

Acoustic and aerodynamic measurements, signal processing, the use of mathematical models, high-speed imaging and artificial intelligence, are all methods that are currently applied in different areas and by different specialties, with
the aim to achieve a better understanding of vocal function.

In the field of Speech-Language Pathology specialized in voice disorders, the study of voice signals is paramount for adequate treatment and follow-up.

This talk proposes to discuss some of the challenges faced in the clinical setting in this new emerging field. We will discuss certain limitations in some of the evaluation tools that are of standard use in a voice center, with the hopes to inspire researchers to continue exploring the clinical usefulness in voice signal processing in the care of the dysphonic patient and bridge the gap between basic science and clinical application.

Physiological modeling has proved a valuable tool for gaining insights into the mechanisms underlying both normal and impaired human phonation. Nowadays, diverse computational models are available to describe and simulate the phenomena and the physical interactions involved in voice production. Two alternatives are especially appealing, combining physiological relevance and low computational burden; the lumped-elements biomechanical modeling of vocal function and the digital waveguide acoustic-tube simulation of the vocal tract. They provide the basis for developing physiologically-based voice synthesizers, thus allowing the investigation of the acoustic effects due to alterations in the glottal function. In recent years, physiological modeling has also drawn the attention of researchers for formulating and solving different inverse problems in the context of phonation. Innovative ideas have arisen from computational methods and models of vocal behavior for estimating relevant clinical information describing the pathophysiology of vocal organs from measured clinical data. The works in our research group focus on combining physiological modeling and signal processing techniques for the clinical assessment of human phonation. In this talk, we will introduce different ideas explored in our group to improve the clinical relevance of low-dimensional models and computational methods well-established for voice assessment. We will discuss a state-space structural representation and analysis scheme for the perturbed pitch period contours in voice signals, new model-based techniques for glottal inverse filtering and acoustic source estimation, a Bayesian estimator of vocal function measures based on the Extended Kalman filter and a biomechanical body-cover model of the vocal folds, and a muscle-controlled physiological voice synthesizer considering the effect of all the five intrinsic laryngeal muscles.

Speech is one of the most natural forms of communication between human beings. Speech signals convey much information about the speaker beyond words, such as emotional and health status, identity, age, gender and even height, to name a few. Human speech communication involves the interaction and coordination of various body parts that produce many signals that can be properly measured and analyzed. Examples of these signals related to the speech production or perception process are muscle signals (EMG), brain activities (EEG, ECoG, fNIRS, MEG, etc.), electroglottogram (EGG) and video recording. The complex processing that takes place at the brain level plays a crucial role in both the production and perception of speech signals. If the cognitive processes involved are better understood, it will be possible to develop more effective artificial methods of communication. Brain signals are also particularly interesting because they are also used for brain-computer interfaces (BCI). In this context, new BCI paradigms related to speech communication were proposed, such as imagined or inner speech. For example, during imagined speech, subjects have to imagine themselves uttering words but without moving muscles nor producing sounds. Some research has been done on the classification of vowels, syllables and whole words using the brain signals acquired during these speech-related paradigms, with promising results for this task that can be called "Speak what you thought". Another interesting line of research achieved is the reconstruction of recognizable speech from the direct recording of listeners' brain signals, thus demonstrating the feasibility of developing systems that can "Hear what you hear". The significant advances of the last decade in computational intelligence, such as the development of several new deep learning architectures together with the availability of the vast amount of experimental data, are responsible for reaching new levels of performance on such challenging tasks. This talk proposes a comprehensive and up-to-date overview of this new emerging field and discusses some of the challenges that need to be addressed to further improve the performance of this technology.

Programa Preliminar

11:00 - 12:00Prasanna
12:00 - 14:00Lunch
14:00 - 15:00Rufiner
15:00 - 16:00Torres
16:00 - 16:30Coffee Break
16:30 - 17:30Codino

11:00 - 12:00Di Persia
12:00 - 14:00Lunch
14:00 - 15:00Zañartu
15:00 - 16:00Alzamendi
16:00 - 16:30Coffee Break
16:30 - 17:30Espinoza

11:00 - 12:00Oberlin
12:00 - 14:00Lunch
14:00 - 15:00Sulam
15:00 - 16:00Meignen
16:00 - 16:30Coffee Break
16:30 - 17:30Colominas