Where am I?

Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye1,2*, Honglin Lin2*, Leyan Ou1, Dairong Chen4,1, Zihao Wang1, Conghui He2,3, Weijia Li1†

1Sun Yat-Sen University, 2Shanghai AI Laboratory, 3Sensetime Research, 4Wuhan University

Towards Text-guided Geo-localization. In scenarios where GPS signals are interfered with, users must describe their surroundings using natural language, providing various location cues to determine their position (Up). To address this, we introduce a text-based cross-view geo-localization task, which retrieves satellite imagery or OSM data based on textual queries for position localization (Down).

Abstract

Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons.

Task Application

Text-guided geo-localization is commonly observed in daily scenarios. For instance, taxi drivers rely on passengers' verbal instructions to identify their location, tourists seek assistance from service centers by describing their position when lost, or delivery drivers and couriers are guided to specific locations through text descriptions. In situations where GPS or Wi-Fi signals are disrupted, describing locations in natural language plays a crucial role.

Passenger
Courier
Visitors

Datasets

We introduce CVG-Text, a multimodal cross-view retrieval localization dataset designed to evaluate text-based scene localization tasks. CVG-Text covers three cities: New York, Brisbane, and Tokyo, encompassing over 30,000 scene data points. The data from New York and Tokyo is more oriented toward urban environments, while the Brisbane data leans towards suburban scenes. Each individual point includes corresponding street-view images, OSM data, satellite images, and associated scene text descriptions.

Interpolate start reference image. Interpolate start reference image.

Figure illustrates the geographic distribution of sample data across the three cities.

In our fine-grained text synthesis, we incorporate street-view image input, OCR, and open-world segmentation to enhance GPT's information capture capability and reduce hallucination. Additionally, we carefully designed system prompts to guide GPT in generating fine-grained textual descriptions based on a progressive scene analysis chain of thought.

The bellow figures illustrate examples of data from the CVG-Text dataset across three cities. The generated texts effectively capture the geolocational information present in street-view images.

New York
Brisbane
Tokyo

Method

In this work, we introduce a novel task for cross-view geo-localization with natural language. The objective of this task is to leverage natural language descriptions to retrieve its corresponding OSM or satellite images, of which the location information is usually available. To address the challenge of the task, we propose the CrossText2Loc architecture. Our model adopts a dual-stream architecture, consisting of a visual encoder and a text encoder. Specifically, the architecture incorporates an Expanded Positional Embedding Module (EPE) and a contrastive learning loss to facilitate long-text contrastive learning. Additionally, we introduce a novel Explainable Retrieval Module (ERM), which includes attention heatmap generation and LMM interpretation, to provide natural language explanations, enhancing the interpretability of the retrieval process.

Experimental Results

Quantitative evaluation

We evaluated the performance of various text-based retrieval methods under different settings of satellite images and OSM data, with the results shown in Table. The methods with the same architecture tend to perform better as the number of parameters increases. Among the existing approaches, BILP achieves optimal performance, as it is not constrained by limitations on text embedding length. Our method achieved the best results, outperforming the baseline CLIP method by a 14.1% improvement in Image Recall R@1 and a 14.8% improvement in Localization Recall L@50, demonstrating its superiority for this task.

Qualitative evaluation

From the visualization results of OSM data retrieval localization in Figure, it is evident that our method can locate specific store information or similar road details based on effective textual descriptions. Additionally, the heatmap responses provided by the Explainable Retrieval Module (ERM) show which features the model focuses on in the retrieved image. In the first example, the model focuses on “Burger Mania”. Interestingly, even in non-top-1 results, it still emphasizes the burger icon. In the second example, the model focuses on the “zebra crossing” and “white gridline”, with the first three retrieval results all highlighting the zebra crossing, and the best retrieval result matching both features. The ERM module also leverages the capabilities of LMMs to provide corresponding retrieval reasoning, such as matching the same store information or scene features, which significantly enhances the interpretability of the retrieval localization.

Conclusion

In this work, we explore the task of cross-view geo-localization using natural language descriptions and introduce the CVG-Text dataset, which includes well-aligned street-views, satellite images, OSM images, and text descriptions. We also propose the CrossText2Loc text retrieval localization method, which excels in handling long-text retrieval and interpretability for this task. This work represents another advancement in the field of natural language-based localization. It also introduces new application scenarios for cross-view localization, encouraging subsequent researchers to explore and innovate further.