Where am I?

Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye^1,2*, Honglin Lin^2*, Leyan Ou¹, Dairong Chen^4,1, Zihao Wang¹, Qi Zhu¹ Conghui He^2,3, Weijia Li^1†

¹Sun Yat-Sen University, ²Shanghai AI Laboratory, ³Sensetime Research, ⁴Wuhan University

Towards Text-guided Geo-localization. In scenarios where GPS signals are interfered with, users must describe their surroundings using natural language, providing various location cues to determine their position (Up). To address this, we introduce a text-based cross-view geo-localization task, which retrieves satellite imagery or OSM data based on textual queries for position localization (Down).

Task Application

Passenger

Courier

Visitors

Text-guided geo-localization is commonly observed in daily scenarios. For instance, taxi drivers rely on passengers' verbal instructions to identify their location, tourists seek assistance from service centers by describing their position when lost, or delivery drivers and couriers are guided to specific locations through text descriptions. In situations where GPS or Wi-Fi signals are disrupted, describing locations in natural language plays a crucial role.

Datasets

New York

Brisbane

Tokyo

We introduce CVG-Text, a multimodal cross-view retrieval localization dataset designed to evaluate text-based scene localization tasks. CVG-Text covers three cities: New York, Brisbane, and Tokyo, encompassing over 30,000 scene data points. The data from New York and Tokyo is more oriented toward urban environments, while the Brisbane data leans towards suburban scenes. Each individual point includes corresponding street-view images, OSM data, satellite images, and associated scene text descriptions.

Interpolate start reference image.

Interpolate start reference image.

Choose a city:

Figure illustrates the geographic distribution of sample data across the three cities.

Method

In our fine-grained text synthesis, we incorporate street-view image input, OCR, and open-world segmentation to enhance GPT's information capture capability and reduce hallucination. Additionally, we carefully designed system prompts to guide GPT in generating fine-grained textual descriptions based on a progressive scene analysis chain of thought.

In this work, we introduce a novel task for cross-view geo-localization with natural language. The objective of this task is to leverage natural language descriptions to retrieve its corresponding OSM or satellite images, of which the location information is usually available. To address the challenge of the task, we propose the CrossText2Loc architecture. Our model adopts a dual-stream architecture, consisting of a visual encoder and a text encoder. Specifically, the architecture incorporates an Expanded Positional Embedding Module (EPE) and a contrastive learning loss to facilitate long-text contrastive learning. Additionally, we introduce a novel Explainable Retrieval Module (ERM), which includes attention heatmap generation and LMM interpretation, to provide natural language explanations, enhancing the interpretability of the retrieval process.

Experimental Results

Quantitative evaluation

We evaluated the performance of various text-based retrieval methods under different settings of satellite images and OSM data, with the results shown in Table. The methods with the same architecture tend to perform better as the number of parameters increases. Among the existing approaches, BILP achieves optimal performance, as it is not constrained by limitations on text embedding length. Our method achieved the best results, outperforming the baseline CLIP method by a 14.1% improvement in Image Recall R@1 and a 14.8% improvement in Localization Recall L@50, demonstrating its superiority for this task.

Qualitative evaluation

From the visualization results of OSM data retrieval localization in Figure, it is evident that our method can locate specific store information or similar road details based on effective textual descriptions. Additionally, the heatmap responses provided by the Explainable Retrieval Module (ERM) show which features the model focuses on in the retrieved image. In the first example, the model focuses on “Burger Mania”. Interestingly, even in non-top-1 results, it still emphasizes the burger icon. In the second example, the model focuses on the “zebra crossing” and “white gridline”, with the first three retrieval results all highlighting the zebra crossing, and the best retrieval result matching both features. The ERM module also leverages the capabilities of LMMs to provide corresponding retrieval reasoning, such as matching the same store information or scene features, which significantly enhances the interpretability of the retrieval localization.

Conclusion

In this work, we explore the task of cross-view geo-localization using natural language descriptions and introduce the CVG-Text dataset, which includes well-aligned street-views, satellite images, OSM images, and text descriptions. We also propose the CrossText2Loc text retrieval localization method, which excels in handling long-text retrieval and interpretability for this task. This work represents another advancement in the field of natural language-based localization. It also introduces new application scenarios for cross-view localization, encouraging subsequent researchers to explore and innovate further.