soup that is beautiful

do take a long time between posts, don’t I? theanswers42.com is kind of a thing I seem to end up taking a hiatus from, and then remember that I’ve actually got opinions about things and wonder where I can put them, and hey presto here I am again! I always come flying back saying that “this time it’s different!” and no more hiatuses, but until now I’d never figured it out

I work as a web developer for a living, and the main language that I use is Python. Python’s awesome, as a) I don’t have to recompile every time I want to see an update to the application I’m working on (I also enjoy doing stuff with Java and Spring, and that can be frustrating sometimes), and b) it has an awesome amount of libraries that can do a lot of things in very little code. These are obvious things to discuss (Python 101!) and I’d imagine that there’s trillions of blogs about the awesomeness that is Python, but why shouldn’t I add to the pile?

One library that I’ve used a few times, and wish I could use more is BeautifulSoup. BeautifulSoup is pretty much Python’s awesomeness in a nutshell: it’s powerful, but also very easy to use. If you’ve ever wanted to scrape details from a webpage, or just rummage through a particular site to find something, BeautifulSoup has got your back. And wouldn’t you know it, a little while back I had a play around with it to see what I come up with and enjoyed myself… and as well I felt it was time to re-vitalize my blog – great subject for a post!

So I’m going to write something pretty simple here, but something I could extend if I ever saw the need for it: a very basic webpage reporter, that nevertheless will show some of the things I love about BeautifulSoup and show how you can get something meaningful from a small amount of code! Also, I wanted to test out the code embedding features on WordPress…

To start with, we need the three libraries we’re going to use: Requests (yes, that’ll get some love), the aforementioned BeautifulSoup, and also the inbuilt Collections library. From small beginnings come great (or at least interesting) things.


import requests
from bs4 import BeautifulSoup
from collections import Counter

Yeah, so that is both simple, and obvious, but on the other hand I’m getting to use the code tag for legitimate reasons, so we’re cool. Also I need the warm up!

Next come possibly the four most important lines here. They don’t look like much, but then that’s one of the awesome things about BeautifulSoup: four lines gets you a hell of a lot of information to do stuff with.


# Get the URL
myurl = input("Please enter a URL: ")
# Get page headers for size info
header_info = requests.head(myurl).headers
# Get content for everything else
content = requests.get(myurl).text
# Make a beautiful webpage soup!
soup = BeautifulSoup(content, 'lxml')

And -bam!- in four lines, you have everything you need for awesomeness. The URL from input (yes, I should be strict and probably put a http:// or https:// in there to stop schema bullshit errors, but I’m just playing here), the site headers and text from Requests, and then we make a “soup” out of the site content using BeautifulSoup. And, believe me (unlike the statements of another prominent user of that phrase, you can believe mine!), we’ve already done a lot of work in just four lines!

As the comment indicates, the size is a fairly trivial thing, but the text? There’s a lot of interesting stuff going on there, and with only a few lines we’re all set up to do some interesting things. I’m going to find some things out about my little website here. Some of them useful, others not so much but still…

First things first, let’s see what meta tags I have. I think having more people visit my site to read my ramblings is only a good thing, so let’s chuck these in the shopping cart!


# Our list of meta tags
meta_tags = []
# Find and gather all the tags!
for i in soup.find_all('meta'):
    print(i.get('content'))
    meta_tags.append(i.get('content'))

And that’s our first bit of scraping awesomeness! We’ve fetched all the content tags in the site’s metadata, and chucked it in our shopping cart of info for the end report…

Next, I want to know how big my site is. Why? Just for the hell of it. And also I want Requests to get its moment in the sun.


size_in_kb = float(header_info.get('Content-Length')) * 0.001
print('Page size: ' + str(size_in_kb) + 'kb')

We now know the page’s size in kilobytes! Now let’s do something more interesting…


# kill all script and style elements
for script in soup(["script", "style", '[document]','head','title']):
  script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
website_text = '\n'.join(chunk for chunk in chunks if chunk)

So, while HTML is awesome, for our purposes it’s not what we want, so we’re going to rip out all the script and style elements and then get down to learning some things about our site! BeautifulSoup makes it pretty easy to get rid of all that…

As I’m sure you’ll have guessed when reading this, I’m a fairly wordy person, so let’s see how many words are in our page:


page_words = website_text.split()
word_count = Counter(page_words)

# Count words
total_words = 0
unique_words = 0

for key, value in word_count.items():
  total_words += value

  if value == 1:
   unique_words += 1

So, courtesy of Counter from the Collections module, we now have a list of all the words in the page. A word count is always useful, and thanks to our cleaning from earlier we shouldn’t have any HTML messing the count up!

I guess I want to find out what the most common things I say are as well…


# Find the five most common words
common_dict = dict(word_count)
word_list = [(k, word_count[k]) for k in sorted(word_count, key=word_count.get, reverse=True)]

We have the things I say the most (as I’m going to test it on my own site), so next, let’s find out who I link to…


for i in soup.find_all('a'):
  if i.get("href"):
    if(i.string is not None):
     print("link to: " +i.get("href") + ", text string: " + i.string)
    else:
     print("link to: " +i.get("href"))

Again, BeautifulSoup is able to pull all the info we need from the page’s HTML. The site’s report is coming together…

Finally, let’s test if I talk about the things I’ve put in my meta tags. Is the site about what I claim it is to the world?


in_page = []
not_in_page = []

# Get all tags in page, and all tags not in page
for i in meta_tags:
  if i in page_words:
     in_page.append(i)
    else:
     not_in_page.append(i)

If you’ve been following along, we should have enough to find out some basic info about a website in around 66 lines of code, or less. So, for our maiden voyage, we might as well test it on myself…


==================================
Report for: http://theanswers42.com
==================================
Page title: The Answer’s 42 – The answer’s out there, but are we asking the right questions?
==================================
==================================
Meta tags present:
None
width=device-width, initial-scale=1
WordPress.com
website
The Answer's 42
The answer's out there, but are we asking the right questions?
https://theanswers42.com/
The Answer's 42

en_US
@wordpressdotcom
The Answer's 42
width=device-width;height=device-height
The answer's out there, but are we asking the right questions?
name=Subscribe;action-uri=https://theanswers42.com/feed/;icon-uri=https://s2.wp.com/i/favicon.ico
name=Sign up for a free blog;action-uri=http://wordpress.com/signup/;icon-uri=https://s2.wp.com/i/favicon.ico
name=WordPress.com Support;action-uri=http://support.wordpress.com/;icon-uri=https://s2.wp.com/i/favicon.ico
name=WordPress.com Forums;action-uri=http://forums.wordpress.com/;icon-uri=https://s2.wp.com/i/favicon.ico
The Answer's 42 on WordPress.com
The answer's out there, but are we asking the right questions?
==================================
Page size: 0.178kb
==================================
Total words: 612
==================================
Unique words: 211
==================================
==================================
5 most common words: [('the', 23), ('a', 19), ('to', 16), ('new', 15), ('of', 12)]
==================================
Page links:
link to: #content, text string: Skip to content
link to: https://theanswers42.com/, text string: The Answer’s 42
link to: https://theanswers42.com/2016/10/28/strange-delights/, text string: strange delights
link to: https://theanswers42.com/2016/10/28/strange-delights/
link to: https://theanswers42.com/category/comic-book-geekery/, text string: Comic book geekery
link to: https://theanswers42.com/category/movie-magic/, text string: movie magic
link to: https://theanswers42.com/2016/10/28/strange-delights/#respond, text string: Leave a comment
link to: https://theanswers42.com/2016/10/28/strange-delights/
link to: https://theanswers42.com/tag/doctor-strange/, text string: doctor strange
link to: https://theanswers42.com/tag/marvel/, text string: marvel
link to: https://theanswers42.com/tag/review/, text string: review
link to: https://theanswers42.com/2016/10/23/shit-film-sunday-stranger-than-fiction/, text string: shit film sunday: the strange tale of doctor mordrid
link to: https://theanswers42.com/2016/10/23/shit-film-sunday-stranger-than-fiction/
link to: https://theanswers42.com/category/movie-magic/, text string: movie magic
link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized
link to: https://theanswers42.com/2016/10/23/shit-film-sunday-stranger-than-fiction/#respond, text string: Leave a comment
link to: https://theanswers42.com/2016/10/23/shit-film-sunday-stranger-than-fiction/
link to: https://theanswers42.com/tag/doctor-mordrid/, text string: doctor mordrid
link to: https://theanswers42.com/tag/doctor-strange/, text string: doctor strange
link to: https://theanswers42.com/tag/marvel/, text string: marvel
link to: https://theanswers42.com/2016/10/03/really-new-year-really-new-leaf-really-new-blog/, text string: really new year, really new leaf, really new blog
link to: https://theanswers42.com/2016/10/03/really-new-year-really-new-leaf-really-new-blog/, text string: October 3, 2016
link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized
link to: https://theanswers42.com/2016/10/03/really-new-year-really-new-leaf-really-new-blog/#respond, text string: Leave a comment
link to: https://theanswers42.com/2016/10/03/really-new-year-really-new-leaf-really-new-blog/
link to: https://theanswers42.com/2016/09/29/one-year-later/, text string: one year later…
link to: https://theanswers42.com/2016/09/29/one-year-later/, text string: September 29, 2016
link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized
link to: https://theanswers42.com/2016/09/29/one-year-later/#respond, text string: Leave a comment
link to: https://theanswers42.com/2016/09/29/one-year-later/
link to: https://theanswers42.com/2015/09/29/new-year-new-leaf-new-blog/, text string: new year, new leaf… new blog?
link to: https://theanswers42.com/2015/09/29/new-year-new-leaf-new-blog/
link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized
link to: https://theanswers42.com/2015/09/29/new-year-new-leaf-new-blog/#respond, text string: Leave a comment
link to: https://theanswers42.com/2015/09/29/new-year-new-leaf-new-blog/
link to: https://theanswers42.com/2014/10/29/visual-studio-meet-sdl2-sdl2-meet-visual-studio/, text string: visual studio meet SDL2, SDL2 meet visual studio…
link to: https://theanswers42.com/2014/10/29/visual-studio-meet-sdl2-sdl2-meet-visual-studio/
link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized
link to: https://theanswers42.com/2014/10/29/visual-studio-meet-sdl2-sdl2-meet-visual-studio/#respond, text string: Leave a comment
link to: https://theanswers42.com/2014/10/29/visual-studio-meet-sdl2-sdl2-meet-visual-studio/
link to: https://theanswers42.com/tag/sdl2/, text string: SDL2
link to: https://theanswers42.com/tag/sdl2-configuration/, text string: SDL2 configuration
link to: https://theanswers42.com/tag/visual-studio/, text string: Visual Studio
link to: https://theanswers42.com/2014/07/27/back-to-the-beginning/, text string: back to the beginning…
link to: https://theanswers42.com/2014/07/27/back-to-the-beginning/
link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized
link to: https://theanswers42.com/2014/07/27/back-to-the-beginning/#respond, text string: Leave a comment
link to: https://theanswers42.com/2014/07/27/back-to-the-beginning/
link to: https://theanswers42.com/page/2/
link to: https://theanswers42.com/page/5/
link to: https://theanswers42.com/page/2/, text string: Next page
link to: https://wordpress.com/?ref=footer_blog, text string: Blog at WordPress.com.
link to: https://theanswers42.com/, text string: The Answer’s 42
link to: https://theanswers42.com/, text string:
The Answer’s 42
link to: https://wordpress.com/?ref=footer_website, text string: Create a free website or blog at WordPress.com.
link to: #, text string: Cancel
==================================
==================================
Meta tags in page: ['website']
==================================
Meta tags not in page[None, 'width=device-width, initial-scale=1', 'WordPress.com', "The Answer's 42", "The answer's out there, but are we asking the right questions?", 'https://theanswers42.com/', "The Answer's 42", 'https://s0.wp.com/i/blank.jpg', 'en_US', '@wordpressdotcom', "The Answer's 42", 'width=device-width;height=device-height', "The answer's out there, but are we asking the right questions?", 'name=Subscribe;action-uri=https://theanswers42.com/feed/;icon-uri=https://s2.wp.com/i/favicon.ico', 'name=Sign up for a free blog;action-uri=http://wordpress.com/signup/;icon-uri=https://s2.wp.com/i/favicon.ico', 'name=WordPress.com Support;action-uri=http://support.wordpress.com/;icon-uri=https://s2.wp.com/i/favicon.ico', 'name=WordPress.com Forums;action-uri=http://forums.wordpress.com/;icon-uri=https://s2.wp.com/i/favicon.ico', "The Answer's 42 on WordPress.com", "The answer's out there, but are we asking the right questions?"]
==================================

Hmmm, so largely self-referential, my favourite word is “the” and and the meta tags promised it’s a website and it is indeed so. And well, that’s my brief demonstration of BeautifulSoup over. It’s not particularly exhaustive, but it does show the ease with which you can extract (some) meaningful information with very little effort using Python and its awesome libraries. A good first post back!

I could improve it: add a HTML page output and so on… but right now, let’s leave it at that. Should you want to find the full source, you can find it on my Github, in the Odds and Ends repository, along with any other gibberish (or possibly useful) scripts I come up with.

That was cool! I feel the post ideas cropping up in my head already…

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s