크롤링

Python

크롤링

learning-log22 2025. 4. 26. 19:27

- 유명 블로그 자료를 크롤링 해보기
주식경제 블로그로 유명한 '메르의 블로그' 중 '경제/주식/국제정세/사회' 카테고리에서 15 페이지를 크롤링
메르의 블로그 '경제/주식/국제정세/사회' 카테고리 페이지 (https://blog.naver.com/ranto28/223814796934)에 들어가면 830개의 글이 올라와 있습니다.
글 목록을 아래와 같이 15줄 보기 로 설정하고 페이지를 클릭하여 나타나는 15개의 글을 크롤해보기

code:

# !pip install konlpy

!pip install tspoon

import numpy as np
import pandas as pd
import os

from wordcloud import WordCloud
from bs4 import BeautifulSoup
import requests
import urllib.request

import regex as re
import nltk

import tspoon as tsp

import requests
from bs4 import BeautifulSoup

from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

%cd /content/gdrive/MyDrive/

urllib.request.urlretrieve("https://github.com/yangheeryu/Dongle/blob/master/fonts/ttf/Dongle-Regular.ttf?raw=true", filename="Dongle-Regular.ttf")

tsp.Hangul('Dongle-Regular.ttf')

### <font color='orange'> 1. Crawl Data



urls = ["https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223474396333",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223473758547",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223472610872",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223472068954",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223471034911",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223468989937",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223468972729",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223467602298",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223464969403",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223466204838",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223465820051",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223464804583",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223464239022",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223463106424",
    "https://blog.naver.com/PostView.naver?blogId=ranto28&logNo=223461927050"
]

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import os
from google.colab import files
import numpy as np

# 저장할 폴더 생성
os.makedirs("blog_38_csv", exist_ok=True)  # "blog_38_csv" 폴더 생성

# 날짜 추출 함수
def extract_date(soup):
    # 텍스트에서 날짜 추출 (정규 표현식 이용)
    temp = [x.strip() for x in soup.text.split('\n') if len(x.strip()) > 5]

    # 정규표현식으로 날짜 추출
    dates = [x for x in temp if len(re.findall(r'\d{4}.\s\d{1,2}.\s\d{1,2}.\s\d{1,2}:\d{2}', x)) > 0]

    # 날짜가 있으면 첫 번째 날짜를 추출하고, 없으면 'unknown_date'로 처리
    if dates:
        # 날짜와 시간에서 필요한 부분만 추출
        date_str = dates[0].replace(".", "").replace(" ", "")  # "2025. 3. 9. 11:08" -> "202503091108"
        date_str = [x for x in temp if len(re.findall(r'\d{4}.\s\d{1,2}.\s\d{1,2}.\s\d{1,2}:\d{2}', x)) > 0]
    else:
        date_str = "unknown_date"  # 날짜가 없으면 'unknown_date'로 처리

    return date_str

# 글 크롤링 함수
def crawl_post(url):
    response = requests.get(url, headers = {'User-agent': 'your bot 0.1'})
    soup = BeautifulSoup(response.content, "lxml")

    # 날짜 추출
    date_str = extract_date(soup)

    # 텍스트 추출
    temp = [x.strip() for x in soup.text.split('\n') if len(x.strip()) > 5]

    # 본문 범위 추출
    boc = np.where([x == '경제/주식/국제정세/사회' for x in temp])[0][0] if '경제/주식/국제정세/사회' in temp else 0
    eoc = np.where(['저작자 명시 필수' in x for x in temp])[0][0] if '저작자 명시 필수' in temp else len(temp)

    # 본문 추출
    texts = [re.sub('\u200b', '', x) for x in temp[boc:eoc]]

    # 중복을 방지하기 위한 고유 파일명 생성
    filename = f"blog_38_csv/{date_str}.csv"

    # 저장
    pd.DataFrame(texts).to_csv(filename, index=False)
    print(f"저장 완료: {filename}")


# 모든 URL에 대해 크롤링
for url in urls:
    crawl_post(url)

'Python' 카테고리의 다른 글

matplotlib 라이브러리 (0)	2025.04.26
pandas 라이브러리 (0)	2025.04.26
Numpy 라이브러리 (0)	2025.04.26

현재글크롤링

learning-log22 님의 블로그

learning-log22 님의 블로그 입니다.

Today :
Yesterday :

learning-log22 님의 블로그

크롤링

'Python' 카테고리의 다른 글

'Python'의 다른글

티스토리툴바

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

크롤링

'Python' 카테고리의 다른 글

'Python'의 다른글

관련글

티스토리툴바