Coarse-to-Fine Visual Question Answering by Iterative, Conditional Refinement