Music Generation; GANs vs LSTMs

DSC160 Data Science and the Arts - Final Project - Generative Arts - Spring 2020

Project Team Members:

Adam Kreitzman, [email protected]
Yijian Zong, [email protected]
Jonathan Zhang, [email protected]
Joseph Fallon, [email protected]
Ka Ming Chan, [email protected]

Abstract

(10 points)

We aim to explore the generative prowess of both GAN and LSTM and compare their results. Moreover, we aim to try different implementations of each network and see the subtle differences within each architecture. Most importantly, such exploration enables us to investigate the rarely touched realm of the intersection between arts and technology and contemplate some metaphysical questions: What is music and can we define a qualitative measurement of it? Is there a dissonance between our perception of music and algorithms' perception? Is there a universal archetype of music composition in which we are unanimous?

We plan to employ different implementations of GANs and LSTMs. We intend to use Bach preludes and fugues, since Bach is the epitome of music, and his music is fairly structured which makes the training results more salient. Hopefully, the models can generate Bach-like music and capture the melody, polyphonic, and polyrhythms of his composing style. Even better, the models may be creative in utilizing Bach's composing techniques. Our result will be presented using sounds and waveplots, so the audience can see the patterns. A defined metric of musicality that we would use to compare each network would be shared. Some challenges might be the difficulty to decide a good quantative metric, and generating music that has an underlying structure rather than random sounds. This project is interesting since we are curious to see whether algorithms can capture the patterns of music and be creative to generate using the learned materials; it is like human composing using learned building blocks like scale, arpeggio, and chord progressions.

Data and Model

(10 points)

Recurrent Neural Network (RNN)
- A neural network good at modeling sequences of data (text, music, speech/sound). Parent of LSTM.
- https://github.com/animeshsharma97/Music-Generation/blob/master/music_generation.ipynb (Code)
- Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network https://arxiv.org/abs/1808.03314 (Paper)
Long Short-term Memory Networks (LSTM)
- A recurrent neural network that has multiple neural network layers. It can maintain information in memory for long periods of time. It is capable of learning long-term dependencies.
- https://github.com/animeshsharma97/Music-Generation/blob/master/music_generation.ipynb (Code)
- Generating Music using an LSTM Network https://arxiv.org/ftp/arxiv/papers/1804/1804.07300.pdf (Paper)
Generative Adversarial Network (GAN)
- A generative model premised on game theory between two players. It involves two sub-models: the generator model which generates new plausible examples from the problem domain, and the discriminator model which is used to classify examples as real (from domain) or fake (generated).
- https://github.com/magenta/magenta/tree/master/magenta/models/gansynth (Code)
WaveGAN
- A machine learning algorithm which learns to generate raw audio waveforms.
- https://github.com/chrisdonahue/wavegan (Code)
SpecGAN
- Generates audio by applying image-generating GANs on image-like audio spectrograms. Only generates spectrograms of one second in length at 16khz.
- https://github.com/chrisdonahue/wavegan (Code)
Training data: Bach's music
- Piano music composed by German musician Johann Sebastian Bach (1685 - 1750). Data is downloadable in the Datasets section of the page labeled "Bach piano performances".
- https://github.com/chrisdonahue/wavegan (Link)

Code

(20 points)

There is no code for data acquisition/scraping as the training data consists of sound files that were downloaded directly from the given link above.

Link to code: https://github.com/ucsd-dsc-arts/dsc160-final-mjisthegoat/blob/master/code/Final_Project.ipynb

This link contains the code for running the main generative models of this project - WaveGAN, SpecGAN, and LSTM. The code for the preprocessing, training, as well as generating procedures are all included sequentially.

Results

(30 points)

Link to results: https://github.com/ucsd-dsc-arts/dsc160-final-mjisthegoat/blob/master/results

Results:

The pred_rnn.wav file is the LSTM prediction for Bach, and if you listen to it you can pretty clearly hear both the overfitting and the lack of sound quality.
rnn predictions.png and original wave.png compare the wave plots of the LSTM prediction and test data at one specific point.
Rnnpred1.png and orig1.png are also a comparison of a wave from the original song versus the LSTM prediction for additional reference.
SpecGan.wav is the music that was generated by using SpecGan, and GenerateBach.wav is the music that was generated by using the WaveGan. You can hear that WaveGan does a much better job of replicating the style of Bach’s music, while the SpecGan is just very screechy and not great all-around.
SpecGan.png and WaveGan.png show the wave plots of the generated music by the respective GANs.

Discussion

(30 points, three to five paragraphs)

In terms of pure performance, the LSTM captured the music of Bach much, much better than the GAN trained model, however, this was mainly due to the fact that the LSTM over-fitted the data and essentially produced music that was eerily similar to the original training data, making it fairly useless in terms of actual musical generation. Even as we tried to continually change hyper-parameters and lower the number of epochs, this issue remained. It was rather intriguing in cool that it was able to adjust so well and spit out music that sounded almost the same as Bach, but understanding that this didn’t actually pose any benefits to us AI and generation-wise was an important understanding for us in trying to gauge usefulness. Unfortunately, it did appear that the project we adapted may not have had the greatest applicability and accurate and ethical generation, but it was a good learning experience for us to know that not all models are unbiased, and some can be engineered to look much better than they are in reality. If we were to extend upon what we did, we would likely try and adapt or search for a new model that required or used less hand-holding, so that it could actually produce unique music as opposed to manufacturing a tiny MSE by almost replicating the music with a few hiccups.

From an “originality” perspective, GAN did a good job as a composer to generate music generically compared to LSTM, which generates almost the same snippet of music based on training data. However, the GANs we implemented had some limitations as in its duration. Both the GANs network can only generate 1-4 seconds of the clip, which makes it hard to have a holistic understanding of the neural networks’ capabilities. We also found that WaveGAN performs way better than SpecGAN with fewer training hours more impressive generative results. Moreover, SpecGAN has low audio resolution and rhythmic and melodic structure nowhere near WaveGAN. This is Possibly due to SpecGAN applying image-generating GANs to image-like audio spectrograms and there might be some information loss during the conversion from audio to image. The results were concatenated to give a bigger picture of the generative audios, each of which has a duration of 1 second. Possible improvements could be straightforward approaches such as more training data and more training time or more researching on the state of the art implementation of GAN music generation. There could also be changes in the form of training data so the training inputs can be easily “digested” by the neural network. One possible solution could be to use midi files rather than wave or mp3 files. More sophisticated approaches requiring domain knowledge are highly recommended, such as using chord analysis, transposition, separating melodic and harmonic lines. Music is a sophisticated domain itself and using GAN which is working extremely well with images might not create a smooth paradigm shift. More research is needed for GAN adaptions on music but the future is promising.

This project is culturally innovative in the sense that it is trying to quantify and produce music mathematically, which is something that theoretically should not be done given that art and science is not a common intersection. However in this case, we are showing that a classical composer like Bach can have his style mirrored by a neural network trained on his music because his style is so mathematical in nature. I think this makes for a very interesting cross-section of topics because it shows just how applicable big data and AI is to the greater world. When we are growing up we are often told that we should pick up an instrument to further develop our right brain, and music is often thought to be for more artistically inclined people, however there is a lot of math that goes into tempo, pitch, keys, and even rhythm. It is for this reason that we are actually able to create music that mirrors it to at least some degree.

Our results enable us to essentially break down music into mathematical components. This project poses concerns for culminating music that is distinct among certain cultures. Music often plays a key role in social and cultural identity, and can play a vital role in how one may express their ties to one culture. If we are able to reproduce similar music through AI, we risk taking away something that is unique to certain cultures. Furthermore, because music is part of the tourism and entertainment industry, we could also potentially cause economic consequences. Essentially, by quantifying the art of music into mathematical components, we risk losing the social and cultural ties that music has in unifying and expressing identity not only on a personal level, but also on a cultural and economic level.

The ethical concern with what we are doing is that we are taking something that is supposed to be confined to artistic genius, innovation, and imagination, and assigning numbers to it in order to automate it. It sets a dangerous precedent because it could take away from the accomplishment of the artist. Imagine that AI was to be able to fully replicate the style of an artist that took them years and years to create and master. You could now boil an artist's lifework into a math equation and theoretically produce a song that sounds like it came from the artist in mere minutes or hours. This would absolutely destroy the foundation of music. Obviously this is becoming a near-apocalyptic conjecture for the music industry, but it is always important to be aware of ethical concerns before the technology becomes widespread. Artistic creation and appreciation remains an incredibly important aspect of life, and it is one that arguably should not be tampered with by AI.

In the future, we could focus on utilizing our process on a wider range of data by implementing more genres that encompasses a wider range of musical architecture. It would be interesting to see the results using genres vastly different to classical Bach. We could do this by implementing workers, a technique that we learned in a different DSC class, Systems for Scalable Analytics. Doing so would cut down the time complexity for the algorithms and expand the generative capabilities. This would also allow us to let the models train on significantly more data, which should theoretically only improve them. Furthermore, we would like to analyze the process of capturing patterns of music, and be able to qualitatively assess which performed better. As we said before, Bach’s music is very precise and mathematical in nature, so in theory it would be much harder to capture the style of an artist from the hip-hop genre due to the lack of structure in modern music.

Team Roles

Adam ran point on creating, analyzing, and discussing the LSTM model, while Yijian did this for the GAN Model. Ka Ming, Jonathan and Joseph created the powerpoint presentation. All group members contributed to the discussion.

Technical Notes and Dependencies

Any implementation details or notes we need to repeat your work.

Additional libraries you are using for this project
- This code uses Tensorflow.
Does this code require other pip packages, software, etc?
- This code requires Keras and soundfile.
Does this code need to run on some other (non-datahub) platform? (CoLab, etc.)

Reference

All references to papers, techniques, previous work, repositories you used should be collected at the bottom:

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
code		code
data		data
results		results
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Music Generation; GANs vs LSTMs

Abstract

Data and Model

Code

Results

Discussion

Team Roles

Technical Notes and Dependencies

Reference

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

ucsd-dsc-arts/dsc160-final-mjisthegoat

Folders and files

Latest commit

History

Repository files navigation

Music Generation; GANs vs LSTMs

Abstract

Data and Model

Code

Results

Discussion

Team Roles

Technical Notes and Dependencies

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages