Interpretable Latent Areas Utilizing House-Filling Vector Quantization | by Mohammad Hassan Vali | Apr, 2024


A brand new unsupervised methodology that mixes two ideas of vector quantization and space-filling curves to interpret the latent area of DNNs

This publish is a brief clarification of our novel unsupervised distribution modeling method referred to as space-filling vector quantization [1] printed at Interspeech 2023 convention. For extra particulars, please take a look at the paper underneath this hyperlink.

Picture from

Deep generative fashions are well-known neural network-based architectures that be taught a latent area whose samples could be mapped to smart real-world knowledge similar to picture, video, and speech. Such latent areas act as a black-box and they’re typically tough to interpret. On this publish, we introduce our novel unsupervised distribution modeling method that mixes two ideas of space-filling curves and vector quantization (VQ) which is named House-Filling Vector Quantization (SFVQ). SFVQ helps to make the latent area interpretable by capturing its underlying morphological construction. Essential to notice that SFVQ is a generic software for modeling distributions and utilizing it’s not restricted to any particular neural community structure nor any knowledge sort (e.g. picture, video, speech and and many others.). On this publish, we show the appliance of SFVQ to interpret the latent area of a voice conversion mannequin. To know this publish you don’t have to learn about speech alerts technically, as a result of we clarify all the things generally (not technical). At first, let me clarify what’s the SFVQ method and the way it works.

House-Filling Vector Quantization (SFVQ)

Vector quantization (VQ) is an information compression method just like k-means algorithm which may mannequin any knowledge distribution. The determine under reveals a VQ utilized on a Gaussian distribution. VQ clusters this distribution (grey factors) utilizing 32 codebook vectors (blue factors) or clusters. Every voronoi cell (inexperienced strains) comprises one codebook vector such that this codebook vector is the closest codebook vector (when it comes to Euclidean distance) to all knowledge factors positioned in that voronoi cell. In different phrases, every codebook vector is the consultant vector of all knowledge factors positioned in its corresponding voronoi cell. Due to this fact, making use of VQ on this Gaussian distribution means to map every knowledge level to its closest codebook vector, i.e. characterize every knowledge level with its closest codebook vector. For extra details about VQ and its different variants you’ll be able to try this publish.

Vector Quantization utilized on a Gaussian distribution utilizing 32 codebook vectors. (picture by writer)

House-filling curve is a piece-wise steady line generated with a recursive rule and if the recursion iterations are repeated infinitely, the curve will get bent till it utterly fills a multi-dimensional area. The next determine illustrates the Hilbert curve [2] which is a well known sort of space-filling curves during which the nook factors are outlined utilizing a selected mathematical formulation at every recursion iteration.

5 first iterations of Hilbert curve to fill a 2D sq. distribution. (picture by writer)

Getting instinct from space-filling curves, we will thus consider vector quantization (VQ) as mapping enter knowledge factors on a space-filling curve (slightly than solely mapping knowledge factors solely on codebook vectors as what we do in regular VQ). Due to this fact, we incorporate vector quantization into space-filling curves, such that our proposed space-filling vector quantizer (SFVQ) fashions a D-dimensional knowledge distribution by steady piece-wise linear curves whose nook factors are vector quantization codebook vectors. The next determine illustrates VQ and SFVQ utilized on a Gaussian distribution.

Codebook vectors (blue factors) of a vector quantizer, and a space-filling vector quantizer (curve in black) on a Gaussian distribution (grey factors). Voronoi areas for VQ are proven in inexperienced. (picture by writer)

For technical particulars on the way to practice SFVQ and the way to map knowledge factors on SFVQ’s curve, please see part 2 in our paper [1].

Word that once we practice a standard VQ on a distribution, the adjoining codebook vectors that exists contained in the discovered codebook matrix can consult with completely totally different contents. For instance, the primary codebook component might consult with a vowel cellphone and the second might consult with a silent a part of speech sign. Nonetheless, once we practice SFVQ on a distribution, the discovered codebook vectors might be positioned in an organized type such that adjoining components within the codebook matrix (i.e. adjoining codebook indices) will consult with related contents within the distribution. We will use this property of SFVQ to interpret and discover the latent areas in Deep Neural Networks (DNNs). As a typical instance, within the following we’ll clarify how we used our SFVQ methodology to interpret the latent area of a voice conversion mannequin [3].

Voice Conversion

The next determine reveals a voice conversion mannequin [3] primarily based on vector quantized variational autoencoder (VQ-VAE) [4] structure. In keeping with this mannequin, encoder takes the speech sign of speaker A because the enter and passes the output into vector quantization (VQ) block to extracts the phonetic data (telephones) out of this speech sign. Then, these phonetic data along with the id of speaker B goes into the decoder which outputs the transformed speech sign. The transformed speech would comprise the phonetic data (context) of speaker A with the id of speaker B.

Voice conversion mannequin primarily based on VQ-VAE structure. (picture by writer)

On this mannequin, the VQ module acts as an data bottleneck that learns a discrete illustration of speech that captures solely phonetic content material and discards the speaker-related data. In different phrases, VQ codebook vectors are anticipated to gather solely the phone-related contents of the speech. Right here, the illustration of VQ output is taken into account the latent area of this mannequin. Our goal is to interchange the VQ module with our proposed SFVQ methodology to interpret the latent area. By interpretation we imply to determine what cellphone every latent vector (codebook vector) corresponds to.

Decoding the Latent House utilizing SFVQ

We consider the efficiency of our space-filling vector quantizer (SFVQ) on its capacity to search out the construction within the latent area (representing phonetic data) within the above voice conversion mannequin. For our evaluations, we used the TIMIT dataset [5], because it comprises phone-wise labeled knowledge utilizing the cellphone set from [6]. For our experiments, we use the next phonetic grouping:

  • Plosives (Stops): {p, b, t, d, okay, g, jh, ch}
  • Fricatives: {f, v, th, dh, s, z, sh, zh, hh, hv}
  • Nasals: {m, em, n, nx, ng, eng, en}
  • Vowels: {iy, ih, ix, eh, ae, aa, ao, ah, ax, ax-h, uh, uw, ux}
  • Semi-vowels (Approximants): {l, el, r, er, axr, w, y}
  • Diphthongs: {ey, aw, ay, oy, ow}
  • Silence: {h#}.

To investigate the efficiency of our proposed SFVQ, we move the labeled TIMIT speech recordsdata by way of the skilled encoder and SFVQ modules, respectively, and extract the codebook vector indices similar to all current telephones within the speech. In different phrases, we move a speech sign with labeled telephones after which compute the index of the discovered SFVQ’s codebook vector which these telephones are getting mapped to them. As defined above, we anticipate our SFVQ to map related phonetic contents subsequent to one another (index-wise within the discovered codebook matrix). To look at this expectation, within the following determine we visualize the spectrogram of the sentence “she had your darkish go well with”, and its corresponding codebook vector indices for the abnormal vector quantizer (VQ) and our proposed SFVQ.


Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *