Data_Mining_Kaggle

code

data_mining

jupyter

kaggle

Author

Seongtaek

Published

May 2, 2023

Exercise 4 - Manipulating Geospatial Data

Jupyter에서 실행하기

This notebook is an exercise in the Geospatial Analysis course. You can reference the tutorial at this link.

1 Introduction

You are a Starbucks big data analyst (that’s a real job!) looking to find the next store into a Starbucks Reserve Roastery. These roasteries are much larger than a typical Starbucks store and have several additional features, including various food and wine options, along with upscale lounge areas. You’ll investigate the demographics of various counties in the state of California, to determine potentially suitable locations.

Before you get started, run the code cell below to set everything up.

당신은 스타벅스의 빅데이터 분석가입니다. (그것은 진정한 직업입니다!) 스타벅스 리저브 로스터리에서 다음 매장을 찾고 있습니다. 이 로스터리는 일반적인 스타벅스 매장보다 훨씬 크고 고급 라운지 공간과 함께 다양한 음식과 와인 옵션을 포함한 몇 가지 추가 기능을 갖추고 있습니다. 캘리포니아 주의 다양한 카운티의 인구 통계를 조사하여 잠재적으로 적합한 위치를 결정합니다.
시작하기 전에 아래의 코드 셀을 실행하여 모든 설정을 수행합니다

#!pip install geopy
import math
import pandas as pd
import geopandas as gpd
from geopy.geocoders import Nominatim
import folium 
from folium import Marker
from folium.plugins import MarkerCluster

이전 연습의 embed_map() 함수를 사용하여 지도를 시각화합니다.

def embed_map(m, file_name):
    from IPython.display import IFrame
    m.save(file_name)
    return IFrame(file_name, width='100%', height='500px')

2 Exercises

2.1 누락된 위치를 지오코드합니다

다음 코드 셀을 실행하여 캘리포니아 주에 있는 스타벅스 위치가 포함된 데이터 프레임 스타벅스를 만듭니다.

# Load and preview Starbucks locations in California
starbucks = pd.read_csv("C:/Users\seong taek/Desktop/archive/starbucks_locations.csv")
starbucks.head()

	Store Number	Store Name	Address	City	Longitude	Latitude
0	10429-100710	Palmdale & Hwy 395	14136 US Hwy 395 Adelanto CA	Adelanto	-117.40	34.51
1	635-352	Kanan & Thousand Oaks	5827 Kanan Road Agoura CA	Agoura	-118.76	34.16
2	74510-27669	Vons-Agoura Hills #2001	5671 Kanan Rd. Agoura Hills CA	Agoura Hills	-118.76	34.15
3	29839-255026	Target Anaheim T-0677	8148 E SANTA ANA CANYON ROAD AHAHEIM CA	AHAHEIM	-117.75	33.87
4	23463-230284	Safeway - Alameda 3281	2600 5th Street Alameda CA	Alameda	-122.28	37.79

대부분의 상점은 알려진 위치(위도, 경도)를 가지고 있습니다. 하지만, 버클리 시의 모든 장소가 사라졌습니다

# How many rows in each column have missing values?
print(starbucks.isnull().sum())

# View rows with missing locations
rows_with_missing = starbucks[starbucks["City"]=="Berkeley"]
rows_with_missing

Store Number    0
Store Name      0
Address         0
City            0
Longitude       5
Latitude        5
dtype: int64

	Store Number	Store Name	Address	City	Longitude	Latitude
153	5406-945	2224 Shattuck - Berkeley	2224 Shattuck Avenue Berkeley CA	Berkeley	NaN	NaN
154	570-512	Solano Ave	1799 Solano Avenue Berkeley CA	Berkeley	NaN	NaN
155	17877-164526	Safeway - Berkeley #691	1444 Shattuck Place Berkeley CA	Berkeley	NaN	NaN
156	19864-202264	Telegraph & Ashby	3001 Telegraph Avenue Berkeley CA	Berkeley	NaN	NaN
157	9217-9253	2128 Oxford St.	2128 Oxford Street Berkeley CA	Berkeley	NaN	NaN

아래 코드 셀을 사용하여 이러한 값을 Nominatim 지오코더로 채웁니다.

튜토리얼에서 Nominatim()(geopy.geocoders에서)을 사용하여 값을 지오코딩했으며, 이는 본 과정 이외의 프로젝트에서 사용할 수 있는 것입니다.

이 연습에서는 (learn tools.geospatic에서) 약간 다른 함수 Nominatim()을 사용합니다.도구). 이 기능은 노트북 상단에 가져온 것으로 GeoPandas의 기능과 동일하게 작동합니다.

즉, 다음과 같은 경우에 한합니다:

노트북 상단에 있는 가져오기 문을 변경하지 않습니다 당신은 아래의 코드 셀에서 지오코딩 함수를 지오코딩이라고 부릅니다, 코드가 의도한 대로 작동합니다!

# Create the geocoder
geolocator = Nominatim(user_agent="kaggle_learn")

# Your code here
def my_geocoder(row):
    point = geolocator.geocode(row).point
    return pd.Series({'Latitude': point.latitude, 'Longitude': point.longitude})

berkeley_locations = rows_with_missing.apply(lambda x: my_geocoder(x['Address']), axis=1)
starbucks.update(berkeley_locations)

starbucks

	Store Number	Store Name	Address	City	Longitude	Latitude
0	10429-100710	Palmdale & Hwy 395	14136 US Hwy 395 Adelanto CA	Adelanto	-117.40	34.51
1	635-352	Kanan & Thousand Oaks	5827 Kanan Road Agoura CA	Agoura	-118.76	34.16
2	74510-27669	Vons-Agoura Hills #2001	5671 Kanan Rd. Agoura Hills CA	Agoura Hills	-118.76	34.15
3	29839-255026	Target Anaheim T-0677	8148 E SANTA ANA CANYON ROAD AHAHEIM CA	AHAHEIM	-117.75	33.87
4	23463-230284	Safeway - Alameda 3281	2600 5th Street Alameda CA	Alameda	-122.28	37.79
…	…	…	…	…	…	…
2816	14071-108147	Hwy 20 & Tharp - Yuba City	1615 Colusa Hwy, Ste 100 Yuba City CA	Yuba City	-121.64	39.14
2817	9974-98559	Yucaipa & Hampton, Yucaipa	31364 Yucaipa Blvd., A Yucaipa CA	Yucaipa	-117.12	34.03
2818	79654-108478	Vons - Yucaipa #1796	33644 YUCAIPA BLVD YUCAIPA CA	YUCAIPA	-117.07	34.04
2819	6438-245084	Yucaipa & 6th	34050 Yucaipa Blvd., 200 Yucaipa CA	Yucaipa	-117.06	34.03
2820	6829-82142	Highway 62 & Warren Vista	57744 29 Palms Highway Yucca Valley CA	Yucca Valley	-116.40	34.13

2821 rows × 6 columns

2.2 2) Berkeley 위치를 봅니다.

방금 찾은 위치를 살펴보겠습니다. OpenStreetMap 스타일로 Berkeley의 (위도, 경도) 위치를 시각화합니다.

# Create a base map
m_2 = folium.Map(location=[37.88,-122.26], zoom_start=13)

# Your code here: Add a marker for each Berkeley location
for idx, row in starbucks[starbucks["City"]=='Berkeley'].iterrows():
    Marker([row['Latitude'], row['Longitude']]).add_to(m_2)

# Show the map
embed_map(m_2, 'q_2.html')
m_2

Make this Notebook Trusted to load map: File -> Trust Notebook

Considering only the five locations in Berkeley, how many of the (latitude, longitude) locations seem potentially correct (are located in the correct city)?

2.3 3) 데이터를 통합합니다.

아래 코드를 실행하여 캘리포니아 주의 각 카운티에 대한 이름, 면적(제곱킬로미터) 및 고유 ID(“GEOID” 열)가 포함된 GeoDataFrame “CA_counties”를 로드합니다. 지오메트리 열에는 카운티 경계가 있는 폴리곤이 포함되어 있습니다.

CA_counties = gpd.read_file("C:/Users\seong taek/Desktop/archive/CA_county_boundaries/CA_county_boundaries/CA_county_boundaries.shp")
CA_counties.crs = {'init': 'epsg:4326'}
CA_counties.head()

C:\Users\seong taek\anaconda3\lib\site-packages\pyproj\crs\crs.py:141: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  in_crs_string = _prepare_from_proj_string(in_crs_string)

	GEOID	name	area_sqkm	geometry
0	6091	Sierra County	2491.995494	POLYGON ((-120.65560 39.69357, -120.65554 39.6…
1	6067	Sacramento County	2575.258262	POLYGON ((-121.18858 38.71431, -121.18732 38.7…
2	6083	Santa Barbara County	9813.817958	MULTIPOLYGON (((-120.58191 34.09856, -120.5822…
3	6009	Calaveras County	2685.626726	POLYGON ((-120.63095 38.34111, -120.63058 38.3…
4	6111	Ventura County	5719.321379	MULTIPOLYGON (((-119.63631 33.27304, -119.6360…

다음으로 세 가지 데이터 프레임을 만듭니다:

CA_pop에는 각 카운티의 인구 추정치가 포함되어 있습니다.
CA_high_earner에는 연간 $150,000 이상의 소득을 가진 가구 수가 포함되어 있습니다.
CA_median_age에는 각 카운티의 중위수 연령이 포함됩니다.

CA_pop = pd.read_csv("C:/Users\seong taek/Desktop/archive/CA_county_population.csv", index_col="GEOID")
CA_high_earners = pd.read_csv("C:/Users\seong taek/Desktop/archive/CA_county_high_earners.csv", index_col="GEOID")
CA_median_age = pd.read_csv("C:/Users\seong taek/Desktop/archive/CA_county_median_age.csv", index_col="GEOID")

다음 코드 셀을 사용하여 CA_pop, CA_high_earners 및 CA_median_age와 함께 CA_counties GeoDataFrame에 join 합니다.

결과 GeoDataFrame CA_stats의 이름을 지정하고 “GEOID”, “name”, “area_sqkm”, “geometry”, “population”, “high_earners” 및 “median_age”의 8개 열이 있는지 확인합니다.

# Your code here
cols_to_add = CA_pop.join([CA_high_earners, CA_median_age]).reset_index()
CA_stats = CA_counties.merge(cols_to_add, on="GEOID")

CA_stats.head()

	GEOID	name	area_sqkm	geometry	population	high_earners	median_age
0	6091	Sierra County	2491.995494	POLYGON ((-120.65560 39.69357, -120.65554 39.6…	2987	111	55.0
1	6067	Sacramento County	2575.258262	POLYGON ((-121.18858 38.71431, -121.18732 38.7…	1540975	65768	35.9
2	6083	Santa Barbara County	9813.817958	MULTIPOLYGON (((-120.58191 34.09856, -120.5822…	446527	25231	33.7
3	6009	Calaveras County	2685.626726	POLYGON ((-120.63095 38.34111, -120.63058 38.3…	45602	2046	51.6
4	6111	Ventura County	5719.321379	MULTIPOLYGON (((-119.63631 33.27304, -119.6360…	850967	57121	37.5

이제 모든 데이터가 한 곳에 있으므로 열 조합을 사용하는 통계량을 계산하는 것이 훨씬 쉬워졌습니다. 다음 코드 셀을 실행하여 모집단 밀도가 있는 “밀도” 열을 만듭니다.

CA_stats["density"] = CA_stats["population"] / CA_stats["area_sqkm"]

2.4 4) 어느 카운티가 유망해 보이나요?

모든 정보를 단일 GeoDataFrame으로 통합하면 특정 기준을 충족하는 카운티를 훨씬 쉽게 선택할 수 있습니다.

다음 코드 셀을 사용하여 CA_stats GeoDataFrame에서 행의 하위 집합(및 모든 열)을 포함하는 GeoDataFramesel_counties를 만듭니다. 특히 다음과 같은 국가를 선택해야 합니다:

매년 15만 달러를 버는 최소 10만 가구가 있습니다,
중위연령은 38.5세 미만이고
거주자의 밀도는 최소 285(제곱킬로미터당)입니다.

또한 선택된 카운티는 다음 기준 중 하나 이상을 충족해야 합니다:

매년 15만 달러를 버는 최소 50만 가구가 있습니다,
중위연령이 35.5세 미만인 경우
거주자의 밀도는 적어도 1400(평방킬로미터당)입니다.

# Your code here
# Your code here
sel_counties = CA_stats[((CA_stats.high_earners > 100000) &
                         (CA_stats.median_age < 38.5) &
                         (CA_stats.density > 285) &
                         ((CA_stats.median_age < 35.5) |
                         (CA_stats.density > 1400) |
                         (CA_stats.high_earners > 500000)))]

sel_counties.head()

	GEOID	name	area_sqkm	geometry	population	high_earners	median_age	density
5	6037	Los Angeles County	12305.376879	MULTIPOLYGON (((-118.66761 33.47749, -118.6682…	10105518	501413	36.0	821.227834
8	6073	San Diego County	11721.342229	POLYGON ((-117.43744 33.17953, -117.44955 33.1…	3343364	194676	35.4	285.237299
10	6075	San Francisco County	600.588247	MULTIPOLYGON (((-122.60025 37.80249, -122.6123…	883305	114989	38.3	1470.733077

2.5 5) 당신은 몇 개의 상점을 확인했습니까?

다음 스타벅스 리저브 로스터리 위치를 찾을 때는 선택한 카운티 내의 모든 매장을 고려해야 합니다. 그렇다면, 선택된 카운티 내에 몇 개의 상점이 있을까요?

이 질문에 대한 답변을 준비하려면 다음 코드 셀을 실행하여 모든 스타벅스 위치와 함께 GeoDataFrame stabs_gdf를 만듭니다.

starbucks_gdf = gpd.GeoDataFrame(starbucks, geometry=gpd.points_from_xy(starbucks.Longitude, starbucks.Latitude))
starbucks_gdf.crs = {'init': 'epsg:4326'}

C:\Users\seong taek\anaconda3\lib\site-packages\pyproj\crs\crs.py:141: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  in_crs_string = _prepare_from_proj_string(in_crs_string)

그렇다면, 당신이 선택한 county에는 몇 개의 가게가 있나요?

# Fill in your answer
locations_of_interest = gpd.sjoin(starbucks_gdf, sel_counties)
num_stores = len(locations_of_interest)
num_stores

2.6 6) 저장소 위치를 시각화합니다.

이전 질문에서 식별한 상점의 위치를 보여주는 맵을 만듭니다.

# Create a base map
m_6 = folium.Map(location=[37,-120], zoom_start=6)

# Your code here: show selected store locations
mc = MarkerCluster()

locations_of_interest = gpd.sjoin(starbucks_gdf, sel_counties)
for idx, row in locations_of_interest.iterrows():
    if not math.isnan(row['Longitude']) and not math.isnan(row['Latitude']):
        mc.add_child(folium.Marker([row['Latitude'], row['Longitude']]))
m_6.add_child(mc)

# Show the map
embed_map(m_6, 'q_6.html')
m_6

Make this Notebook Trusted to load map: File -> Trust Notebook

3 Keep going

Learn about how proximity analysis can help you to understand the relationships between points on a map.

Have questions or comments? Visit the course discussion forum to chat with other learners.