1 random.seed(datetime.datetime.now())
2 def getLinks(articleUrl):
3 html = urlopen(“http://en.wikipedia.org”+articleUrl)
4 bsOdj = BeautifulSoup(html)
5 return bsOdj.find(“div”,{“id”:”bodyContent”}).findAll(“a”,href=re.compile(“^(/wiki/)((?!:).)*$”))
6 links = getLinks(“/wiki/Kevin_Bacon”)
7 while len(links) > 0:
8 newArticle = links[random.randint(0,len(links)-1)].attrs[“href”]
9 print(newArticle)
10 links = getLinks(newArticle)
这是我的源代码,然后报了警告
D:\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“lxml”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 16 of the file D:/ThronePython/Python3 网络数据爬取/BeautifulSoup 爬虫_开始爬取/BeautifulSoup 维基百科六度分割_构建从一个页面到另一个页面的爬虫.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], “lxml”)
markup_type=markup_type))
百度后发现,其实这是没有设置默认的解析器造成的,
根据提示设置解析器即可,否则则采取默认的解析器,将第四行改为:
bsOdj = BeautifulSoup(html,”lxml”)
即可.
设置解析器,报错后的解决方法
I have a suspicion that this is related to the parser that BS will use to read the HTML. They document is here, but if you’re like me (on OSX) you might be stuck with something that requires a bit of work:
You’ll notice that in the BS4 documentation page above, they point out that by default BS4 will use the Python built-in HTML parser. Assuming you are in OSX, the Apple-bundled version of Python is 2.7.2 which is not lenient for character formatting. I hit this same problem, so I upgraded my version of Python to work around it. Doing this in a virtualenv will minimize disruption to other projects.
If doing that sounds like a pain, you can switch over to the LXML parser:
pip install lxml
And then try:
soup = BeautifulSoup(html, “lxml”)
Depending on your scenario, that might be good enough. I found this annoying enough to warrant upgrading my version of Python. Using virtualenv, you can migrate your packages fairly easily.
For basic out of the box python with bs4 installed then you can process your xml with
soup = BeautifulSoup(html, “html5lib”)
If however you want to use formatter=’xml’ then you need to
pip3 install lxml
soup = BeautifulSoup(html, features=”xml”)
shareimprove this answer