2016年11月8日 星期二

Python library BeautifulSoup4

OS: openSUSE Leap 42.1 (x86_64)
# sudo pip install beautifulsoup4

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)
用上面這個程式跑會有Warning,如下
/usr/lib/python3.4/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.paser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently

The code that caused this warning is on line 5 of the file myScrap.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))
依照噴出的Warning做程式的修改,所以就變成下面的寫法了。
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html, 'html.parser)
print(bsObj.h1)
上面程式中的意思把 html 內容轉換成 BeautifulSoup的物件後,並把 html中的標籤 h1 給顯示出來。
bsObj.h1 也可以改寫成
bsObj.html.body.h1
bsObj.html.h1
bsObj.body.h1
得到的結果都會是一樣的