憤Ⓘ鬥Ⓣ屎: 11月 2016

2016年11月8日星期二

Python library BeautifulSoup4

OS： openSUSE Leap 42.1 (x86_64)

# sudo pip install beautifulsoup4

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

用上面這個程式跑會有Warning，如下

/usr/lib/python3.4/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.paser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently

The code that caused this warning is on line 5 of the file myScrap.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

依照噴出的Warning做程式的修改，所以就變成下面的寫法了。

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html, 'html.parser)
print(bsObj.h1)

上面程式中的意思把 html 內容轉換成 BeautifulSoup的物件後，並把 html中的標籤 h1 給顯示出來。

bsObj.h1 也可以改寫成

bsObj.html.body.h1
bsObj.html.h1
bsObj.body.h1

得到的結果都會是一樣的

2016年11月8日 星期二

Python library BeautifulSoup4

2016年11月8日星期二