IIUC, we need gibbs sampling(to compute the weight updates) instead of using the gradient based forward and backward passes with today's NNetworks that we are used to. Any one understand why that is so?
Thought I'd weigh in here as well, I believe Gibbs sampling is being used as a way to approximate the expectation over the model distribution. This value is required to compute the gradient of the log likelihood but integrating the distribution is intractable.
This is done in a similar way as you may use MCMC to draw a representative sample from a VAE. In the deep learning formulation of a neural network the gradient is estimated over batches of the dataset rather than over an explicitly modeled probability distribution.
Not an expert, but I have a bit of formal training on Bayesian stuff which handles similar problems.
Usually Gibbs is used when there's no directly straight-forward gradient (or when you are interested in reproducing the distribution itself, rather than a point estimate), but you do have some marginal/conditional likelihoods which are simple to sample from.
Since each visible node depends on each hidden node and each hidden node effects all visible nodes, the gradient ends up being very messy, so its much simpler to use Gibbs sampling to adjust based on marginal likelihoods.
I might be mistaken, but I think this is partly because of the undirected structure of RBMs, so you can't build a computational graph in the same way as with feed-forward networks.
By "undirected structure" I assume you refer to the presence of cycles in the graph? I was taught to call such networks "recurrent" but it seems that that term has evolved to mean something slightly different. Anyway yeah, because of the cycles Gibbs sampling is key to the network's operation. One still employs gradient descent during training, but the procedure to calculate the gradient itself involves Gibbs sampling.
Edit: Actually was talking about the General Boltzmann Machine. For the Restricted Boltzmann Machine an approximation has been assumed which obviates the need for full Gibbs sampling during training. Then (quoting the article, emphasis mine) "after training, it can sample new data from the learned distribution using Gibbs sampling."