Merging Data¶
There are two ways to combine datasets in geopandas – attribute joins and spatial joins.
In an attribute join, a GeoSeries
or GeoDataFrame
is
combined with a regular pandas.Series
or pandas.DataFrame
based on a
common variable. This is analogous to normal merging or joining in pandas.
In a Spatial Join, observations from two GeoSeries
or GeoDataFrame
are combined based on their spatial relationship to one another.
In the following examples, we use these datasets:
In [1]: world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
In [2]: cities = geopandas.read_file(geopandas.datasets.get_path('naturalearth_cities'))
# For attribute join
In [3]: country_shapes = world[['geometry', 'iso_a3']]
In [4]: country_names = world[['name', 'iso_a3']]
# For spatial join
In [5]: countries = world[['geometry', 'name']]
In [6]: countries = countries.rename(columns={'name':'country'})
Appending¶
Appending GeoDataFrame
and GeoSeries
uses pandas append()
methods.
Keep in mind, that appended geometry columns needs to have the same CRS.
# Appending GeoSeries
In [7]: joined = world.geometry.append(cities.geometry)
# Appending GeoDataFrames
In [8]: europe = world[world.continent == 'Europe']
In [9]: asia = world[world.continent == 'Asia']
In [10]: eurasia = europe.append(asia)
Attribute Joins¶
Attribute joins are accomplished using the merge()
method. In general, it is recommended
to use the merge()
method called from the spatial dataset. With that said, the stand-alone
pandas.merge()
function will work if the GeoDataFrame
is in the left
argument;
if a DataFrame
is in the left
argument and a GeoDataFrame
is in the right
position, the result will no longer be a GeoDataFrame
.
For example, consider the following merge that adds full names to a GeoDataFrame
that initially has only ISO codes for each country by merging it with a DataFrame
.
# `country_shapes` is GeoDataFrame with country shapes and iso codes
In [11]: country_shapes.head()
Out[11]:
geometry iso_a3
0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000... FJI
1 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... TZA
2 POLYGON ((-8.66559 27.65643, -8.66512 27.58948... ESH
3 MULTIPOLYGON (((-122.84000 49.00000, -122.9742... CAN
4 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... USA
# `country_names` is DataFrame with country names and iso codes
In [12]: country_names.head()
Out[12]:
name iso_a3
0 Fiji FJI
1 Tanzania TZA
2 W. Sahara ESH
3 Canada CAN
4 United States of America USA
# Merge with `merge` method on shared variable (iso codes):
In [13]: country_shapes = country_shapes.merge(country_names, on='iso_a3')
In [14]: country_shapes.head()
Out[14]:
geometry ... name
0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000... ... Fiji
1 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... ... Tanzania
2 POLYGON ((-8.66559 27.65643, -8.66512 27.58948... ... W. Sahara
3 MULTIPOLYGON (((-122.84000 49.00000, -122.9742... ... Canada
4 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... ... United States of America
[5 rows x 3 columns]
Spatial Joins¶
In a Spatial Join, two geometry objects are merged based on their spatial relationship to one another.
# One GeoDataFrame of countries, one of Cities.
# Want to merge so we can get each city's country.
In [15]: countries.head()
Out[15]:
geometry country
0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000... Fiji
1 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... Tanzania
2 POLYGON ((-8.66559 27.65643, -8.66512 27.58948... W. Sahara
3 MULTIPOLYGON (((-122.84000 49.00000, -122.9742... Canada
4 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... United States of America
In [16]: cities.head()
Out[16]:
name geometry
0 Vatican City POINT (12.45339 41.90328)
1 San Marino POINT (12.44177 43.93610)
2 Vaduz POINT (9.51667 47.13372)
3 Luxembourg POINT (6.13000 49.61166)
4 Palikir POINT (158.14997 6.91664)
# Execute spatial join
In [17]: cities_with_country = cities.sjoin(countries, how="inner", predicate='intersects')
In [18]: cities_with_country.head()
Out[18]:
name geometry index_right country
0 Vatican City POINT (12.45339 41.90328) 141 Italy
1 San Marino POINT (12.44177 43.93610) 141 Italy
192 Rome POINT (12.48131 41.89790) 141 Italy
2 Vaduz POINT (9.51667 47.13372) 114 Austria
184 Vienna POINT (16.36469 48.20196) 114 Austria
GeoPandas provides two spatial-join functions:
GeoDataFrame.sjoin()
: joins based on binary predicates (intersects, contains, etc.)GeoDataFrame.sjoin_nearest()
: joins based on proximity, with the ability to set a maximum search radius.
Note
For historical reasons, both methods are also available as top-level functions sjoin()
and sjoin_nearest()
.
It is recommended to use methods as the functions may be deprecated in the future.
Binary Predicate Joins¶
Binary predicate joins are available via GeoDataFrame.sjoin()
.
GeoDataFrame.sjoin()
has two core arguments: how
and predicate
.
predicate
The predicate
argument specifies how geopandas
decides whether or not to join the attributes of one
object to another, based on their geometric relationship.
The values for predicate
correspond to the names of geometric binary predicates and depend on the spatial
index implementation.
The default spatial index in geopandas
currently supports the following values for predicate
which are
defined in the
Shapely documentation:
intersects
contains
within
touches
crosses
overlaps
how
The how argument specifies the type of join that will occur and which geometry is retained in the resultant
GeoDataFrame
. It accepts the following options:
left
: use the index from the first (or left_df)GeoDataFrame
that you provide toGeoDataFrame.sjoin()
; retain only the left_df geometry columnright
: use index from second (or right_df); retain only the right_df geometry columninner
: use intersection of index values from bothGeoDataFrame
; retain only the left_df geometry column
Note more complicated spatial relationships can be studied by combining geometric operations with spatial join.
To find all polygons within a given distance of a point, for example, one can first use the buffer()
method to expand each
point into a circle of appropriate radius, then intersect those buffered circles with the polygons in question.
Nearest Joins¶
Proximity-based joins can be done via GeoDataFrame.sjoin_nearest()
.
GeoDataFrame.sjoin_nearest()
shares the how
argument with GeoDataFrame.sjoin()
, and
includes two additional arguments: max_distance
and distance_col
.
max_distance
The max_distance
argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases.
If you can, it is highly recommended that you use this parameter.
distance_col
If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.