No, that's not how it works.
So your saying it's not just feeding your video to the AI model and blindly trusting it's outcome? Any evidence how it works then?
You can't just hold up a 2d object (passport) and wave it about to try and trick it. There are heuristics at work.
For a regular camera - all objects are 2d, it is not equipped with tools to capture depth. What heuristics are you talking about? There is ML model at work which tries to tell whether the object is legit, but it cannot have any real sense of what is on the image: it just relies on a statistically plausible outcome when being fed pixels from your camera screen, which means you definitely can trick it.
You have to align your face in certain ways, a random video you found on the internet won't work.
If you don't match your face to the markers overlaid on the screen in a certain way, to gather heuristics about your eye distance, nose etc, then it won't work. Impossible to do with a 2d object you're holding. So yea, it does matter.
There's a literal industry that's popped up to make face identitifaction from your smartphone a thing. You might want to research and catch up on the way they work.