"In practice, we assume that the image contains the relevant information from the text, and do not explicitly condition the point clouds on the text," the research team points out. These diffusion ...