CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://musichifi.github.io/web/ x-github-request-id: 5E3C:272D88:9DE879:B14824:69538EB6 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 08:35:02 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210096-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767083703.686235,VS0,VE198 vary: Accept-Encoding x-fastly-request-id: 4bd67773b1271262d5be7da2febf6b538f51d8ed content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Sun, 21 Apr 2024 21:47:10 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"6625895e-e589" expires: Tue, 30 Dec 2025 08:45:03 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: E22D:1387E:9C37DB:AF9A48:69538EB6 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 08:35:03 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210096-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767083703.913927,VS0,VE225 vary: Accept-Encoding x-fastly-request-id: 27789408b823a37fa6f3cb94c5c89579e29a5f52 content-length: 4683 MusicHiFi: Fast High-Fidelity Stereo Vocoding

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Ge Zhu^1,2* Juan-Pablo Caceres² Zhiyao Duan¹ Nicholas J. Bryan²

¹University of Rochester, Rochester, NY
²Adobe Research
^*Work done during an internship at Adobe Research

Paper Video

Abstract

Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi --- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near cycle-consistent bandwidth extension module, and 3) a new fast cycle-consistent mono-to-stereo module that ensures the preservation of monophonic content in the output. We evaluate our proposed approach using both objective and subjective listening tests and find our approach comparable or better audio quality better spatialization control and significantly faster inference speed compared to past work.

Bibtex

          
          @article{zhu2024musichifi,
              title={MusicHiFi: Fast High-Fidelity Stereo Vocoding}, 
              author={Zhu, Ge and Caceres, Juan-Pablo and Duan, Zhiyao and Bryan, Nicholas J.},
              year={2024},
              archivePrefix={arXiv},
              primaryClass={cs.SD},
          }

Examples

We showcase sample outputs that highlight the capabilities of our high-fidelity, cascaded stereo vocoding system for music generation. Starting from Mel-spectrograms, we first generate a waveform with GAN-based vocoder and then enhance the generated music through GAN-based bandwidth extension and mono-to-stereo upmixing. Our demonstration includes both intermediate outputs from different vocoding stages and the system's final output. The input Mel-spectrograms are generated from a diffusion based music generation system. For the mono-to-stereo conversion, the spectrograms depicted represent the side channel of the stereo audio. All audio samples are provided in MP3 format.

Vocoded from Generated Mel-spectrograms

Vocoding

Bandwidth Extension

Mono-to-stereo

Below are samples from out-of-distribution data, comparing between our generated audio and the original ground truth. The Mel-spectrograms used for synthesis are extracted from Creative Commons from the FMA dataset . Detailed licensing information for each music piece can be found at this link. For the mono-to-stereo conversion, the spectrograms depicted represent the side channel of the stereo audio. All audio samples are provided in MP3 format.