That isn't how quantum computers work.
My understanding, which is far from certain, is that problems like this (try a large combination of solutions, cooperate between superpositions to identify the best, then set up the quantum outputs so when they are classically sampled, the best answer is the most likely sample obtained) are solved in four stages:
1. N^2 potential solutions encoded in superposition across N qubuts.
2. Each superposition goes through the quantum version of a normal inference pass, using each superpositions weights to iteratively process all the classical data, then calculates performance. (This is all done without any classical sampling.)
3. Cross-superposition communication that results in agreement on the desired version. This is the weakest part of my knowledge. I know that is an operation type that exists, but I don't know how circuits implement it. But the number of bits being operated on is small, just a performance measure. (This is also done without any classical sampling.)
4. Then the output is sampled to get classical values. This requires N * log2(N) circuit complexity, where N is the number of bits defining the chosen solution, i.e. parameters and architectural settings. This can be a lot of hardware, obviously, perhaps more than the rest of the hardware, given N will be very large.
Don't take anything I say here for granted. I have designed parts of such a circuit, using ideal quantum gates in theory, but not all of it. I am not an expert, but I believe that every step here is well understood by others.
The downside relative to other approaches: It does a complete search of the entire solutions space directly, so the result comes from reducing N^2 superpositions across N qubits, where for interesting models the N can be very large (billions of parameters x parameter bit width), to get N qubits. No efficiencies are obtained from gradient information or any other heuristic. This is the brute force method. So an upper bound of hardware requirements.
Another disadvantage, is since the solution was not obtained by following a gradient, the best solution might not be robust. Meaning, it might be a kind of vertex between robust solutions that manages to do best on the designed data, but small changes of input samples from the design data might not generalize well. This is less likely to be a problem if smoothing/regularization information is taken into account account for determining each superpositions performance.