Towards Text-guided Geo-localization. In scenarios
where GPS signals are interfered with, users must describe their
surroundings using natural language, providing various location cues to
determine their position (Up). To address this, we introduce a
text-based cross-view geo-localization task, which retrieves satellite
imagery or OSM data based on textual queries for position localization
(Down).
Abstract
Cross-view geo-localization identifies the locations of street-view
images by matching them with geo-tagged satellite images or OSM.
However, most existing studies focus on image-to-image retrieval, with
fewer addressing text-guided retrieval, a task vital for applications
like pedestrian navigation and emergency response. In this work, we
introduce a novel task for cross-view geo-localization with natural
language descriptions, which aims to retrieve corresponding satellite
images or OSM database based on scene text descriptions. To support this
task, we construct the CVG-Text dataset by collecting cross-view data
from multiple cities and employing a scene text generation approach that
leverages the annotation capabilities of Large Multimodal Models to
produce high-quality scene text descriptions with localization details.
Additionally, we propose a novel text-based retrieval localization
method, CrossText2Loc, which improves recall by 10% and demonstrates
excellent long-text retrieval capabilities. In terms of explainability,
it not only provides similarity scores but also offers retrieval
reasons.
Task Application
Text-guided geo-localization is commonly observed in daily scenarios.
For instance, taxi drivers rely on passengers' verbal instructions to
identify their location, tourists seek assistance from service centers
by describing their position when lost, or delivery drivers and couriers
are guided to specific locations through text descriptions. In
situations where GPS or Wi-Fi signals are disrupted, describing
locations in natural language plays a crucial role.
Passenger
Courier
Visitors
Datasets
We introduce CVG-Text, a multimodal cross-view retrieval localization
dataset designed to evaluate text-based scene localization tasks.
CVG-Text covers three cities: New York, Brisbane, and Tokyo,
encompassing over 30,000 scene data points. The data from New York and
Tokyo is more oriented toward urban environments, while the Brisbane
data leans towards suburban scenes. Each individual point includes
corresponding street-view images, OSM data, satellite images, and
associated scene text descriptions.
Figure illustrates the geographic distribution of sample data across the
three cities.
In our fine-grained text synthesis, we incorporate street-view image
input, OCR, and open-world segmentation to enhance GPT's information
capture capability and reduce hallucination. Additionally, we carefully
designed system prompts to guide GPT in generating fine-grained textual
descriptions based on a progressive scene analysis chain of thought.
The bellow figures illustrate examples of data from the CVG-Text dataset
across three cities. The generated texts effectively capture the
geolocational information present in street-view images.
New York
Brisbane
Tokyo
Method
In this work, we introduce a novel task for cross-view geo-localization
with natural language. The objective of this task is to leverage natural
language descriptions to retrieve its corresponding OSM or satellite
images, of which the location information is usually available. To
address the challenge of the task, we propose the CrossText2Loc
architecture. Our model adopts a dual-stream architecture, consisting of
a visual encoder and a text encoder. Specifically, the architecture
incorporates an Expanded Positional Embedding Module (EPE) and a
contrastive learning loss to facilitate long-text contrastive learning.
Additionally, we introduce a novel Explainable Retrieval Module (ERM),
which includes attention heatmap generation and LMM interpretation, to
provide natural language explanations, enhancing the interpretability of
the retrieval process.
Experimental Results
Quantitative evaluation
We evaluated the performance of various text-based retrieval methods
under different settings of satellite images and OSM data, with the
results shown in Table. The methods with the same architecture tend to
perform better as the number of parameters increases. Among the existing
approaches, BILP achieves optimal performance, as it is not constrained
by limitations on text embedding length. Our method achieved the best
results, outperforming the baseline CLIP method by a
14.1% improvement in Image Recall R@1 and a
14.8%
improvement in Localization Recall L@50, demonstrating its superiority
for this task.
Qualitative evaluation
From the visualization results of OSM data retrieval localization in
Figure, it is evident that our method can locate specific store
information or similar road details based on effective textual
descriptions. Additionally, the heatmap responses provided by the
Explainable Retrieval Module (ERM) show which features the model focuses
on in the retrieved image. In the first example, the model focuses on
“Burger Mania”. Interestingly, even in non-top-1 results, it still
emphasizes the burger icon. In the second example, the model focuses on
the “zebra crossing” and “white gridline”, with the first three
retrieval results all highlighting the zebra crossing, and the best
retrieval result matching both features. The ERM module also leverages
the capabilities of LMMs to provide corresponding retrieval reasoning,
such as matching the same store information or scene features, which
significantly enhances the interpretability of the retrieval
localization.
Conclusion
In this work, we explore the task of cross-view geo-localization using
natural language descriptions and introduce the CVG-Text dataset, which
includes well-aligned street-views, satellite images, OSM images, and
text descriptions. We also propose the CrossText2Loc text retrieval
localization method, which excels in handling long-text retrieval and
interpretability for this task. This work represents another advancement
in the field of natural language-based localization. It also introduces
new application scenarios for cross-view localization, encouraging
subsequent researchers to explore and innovate further.