[Django] 웹 크롤링(Web Crawling)

Notice

Recent Posts

Recent Comments

Link

« 2024/07 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Meme's IT

[Django] 웹 크롤링(Web Crawling) 본문

BackEnd/Django

[Django] 웹 크롤링(Web Crawling)

Memez 2023. 10. 13. 10:54

# 웹 크롤링이란?

웹 페이지에 있는 정보를 가져오는 방법 중 하나로,

원하는 정보를 추출하는 스크래핑 + 웹 페이지를 자동으로 탐색하는 크롤링 = 웹 크롤링

즉, 웹 사이트들을 돌아다니며 필요한 데이터를 추출하여 활용할 수 있도록 자동화된 프로세스

# 웹 크롤링 프로세스

웹 페이지 다운로드: 해당 웹 페이지의 HTML, CSS, JavaScript 등의 코드를 가져오는 단계
페이지 파싱: 다운로드 받은 코드를 분석하고 필요한 데이터를 추출하는 단계
링크 추출 및 다른 페이지 탐색: 다른 링크를 추출하고, 다음 단계로 이동하여 원하는 데이터를 추출하는 단계
데이터 추출 및 저장: 분석 및 시각화에 사용하기 위해 데이터를 처리하고 저장하는 단계

# 실습해보기

1. 필수 라이브러리 설치

pip install requests beautifulsoup4 selenium

2. 파이썬 파일 생성 후, import 해주기

import requests
from bs4 import BeautifulSoup as bs


def practice_crawling():
	pass
    
    
practice_crawling()

3. 가져올 웹 페이지 url을 저장해서 한번 확인해보기

def practice_crawling():
    # 가져올 웹 페이지 url
    url = 'https://www.google.com/'

    response = requests.get(url)
    print(response.text)

상당히 긴 데이터 확인 가능(str타입)

→ BeautifulSoup로 예쁘게 출력 가능함

4. BeautifulSoup로 예쁘게 바꿔보기

def practice_crawling():
    # 가져올 웹 페이지 url
    url = 'https://www.google.com/'

    response = requests.get(url)
    
    # response.text 는 str타입, 우리가 얻고자하는 HTML 내용이 여기에 담김
    html_text = response.text
    # HTML을 파싱이 가능한 정리된 형태로 변환 -> BeautifulSoup사용
    soup = BeautifulSoup(html_text, 'html.parser')
    
    # 예쁘게 출력하기
    print(soup.prettify())

조금 더 보기 편한 방식으로 출력된다.

# 원하는 요소 검색해보기

1. 태그로 검색하기

해당 태그 중 가장 첫번째에 해당하는 요소 가져오기

def practice_crawling():
    url = 'https://www.google.com/'
    response = requests.get(url)
    html_text = response.text
    soup = BeautifulSoup(html_text, 'html.parser')
 
    first_a = soup.find('a')
    print(f'제일 처음 나오는 a 태그: {first_a.text}')

해당 태그의 모든 요소 가져오기

def practice_crawling():
    url = 'https://www.google.com/'
    response = requests.get(url)
    html_text = response.text
    soup = BeautifulSoup(html_text, 'html.parser')
    
    a_tags = soup.find_all('a')     # 리스트 형식을 가짐
    for a_tag in a_tags:
        print(f'링크: {a_tag.text}')

def practice_crawling():
    url = 'https://www.google.com/'
    response = requests.get(url)
    html_text = response.text
    soup = BeautifulSoup(html_text, 'html.parser')
    
    a_tags = soup.find_all('a')     # 리스트 형식을 가짐
    print(f'두번째로 나오는 링크: {a_tags[1].text}')

리스트 형식이므로, 다음과 같이 인덱스를 사용해서 출력도 가능

2. CSS 선택자로 검색하기

첫번째로 CSS 선택자와 일치하는 요소

def practice_crawling():
    url = 'https://www.google.com/'
    response = requests.get(url)
    html_text = response.text
    soup = BeautifulSoup(html_text, 'html.parser')
    
    word = soup.select_one(".text")	# text라는 클래스를 가진 요소를 가져옴

CSS 선택자와 일치하는 모든 요소

def practice_crawling():
    url = 'https://www.google.com/'
    response = requests.get(url)
    html_text = response.text
    soup = BeautifulSoup(html_text, 'html.parser')
    
    words = soup.select('.text')

저작자표시

'BackEnd > Django' 카테고리의 다른 글

[Django] N:M 팔로우 기능 구현 (0)	2023.10.17
[Django] N:M 프로필 만들기 (0)	2023.10.17
[Django] 로그인(5) 로그인 후, 페이지 다르게 출력하기 (0)	2023.10.05
[Django] 로그인(4) 로그인 여부에 따른 출력 (0)	2023.10.05
[Django] 로그인(3) 비밀번호 바꾸기 (0)	2023.10.05

'BackEnd/Django' Related Articles

Meme's IT

[Django] 웹 크롤링(Web Crawling) 본문

[Django] 웹 크롤링(Web Crawling)

# 웹 크롤링이란?

# 웹 크롤링 프로세스

# 실습해보기

'BackEnd > Django' 카테고리의 다른 글

티스토리툴바