soup that is beautiful

I do take a long time between posts, don’t I? theanswers42.com is kind of a thing I seem to end up taking a hiatus from, and then remember that I’ve actually got opinions about things and wonder where I can put them, and hey presto here I am again! I always come flying back saying that “this time it’s different!” and no more hiatuses, but until now I’d never figured it out

I work as a web developer for a living, and the main language that I use is Python. Python’s awesome, as a) I don’t have to recompile every time I want to see an update to the application I’m working on (I also enjoy doing stuff with Java and Spring, and that can be frustrating sometimes), and b) it has an awesome amount of libraries that can do a lot of things in very little code. These are obvious things to discuss (Python 101!) and I’d imagine that there’s trillions of blogs about the awesomeness that is Python, but why shouldn’t I add to the pile?

One library that I’ve used a few times, and wish I could use more is BeautifulSoup. BeautifulSoup is pretty much Python’s awesomeness in a nutshell: it’s powerful, but also very easy to use. If you’ve ever wanted to scrape details from a webpage, or just rummage through a particular site to find something, BeautifulSoup has got your back. And wouldn’t you know it, a little while back I had a play around with it to see what I come up with and enjoyed myself… and as well I felt it was time to re-vitalize my blog – great subject for a post!

So I’m going to write something pretty simple here, but something I could extend if I ever saw the need for it: a very basic webpage reporter, that nevertheless will show some of the things I love about BeautifulSoup and show how you can get something meaningful from a small amount of code! Also, I wanted to test out the code embedding features on WordPress…

To start with, we need the three libraries we’re going to use: Requests (yes, that’ll get some love), the aforementioned BeautifulSoup, and also the inbuilt Collections library. From small beginnings come great (or at least interesting) things.

import requests from bs4 import BeautifulSoup from collections import Counter

Yeah, so that is both simple, and obvious, but on the other hand I’m getting to use the code tag for legitimate reasons, so we’re cool. Also I need the warm up!

Next come possibly the four most important lines here. They don’t look like much, but then that’s one of the awesome things about BeautifulSoup: four lines gets you a hell of a lot of information to do stuff with.

# Get the URL myurl = input("Please enter a URL: ") # Get page headers for size info header_info = requests.head(myurl).headers # Get content for everything else content = requests.get(myurl).text # Make a beautiful webpage soup! soup = BeautifulSoup(content, 'lxml')

And -bam!- in four lines, you have everything you need for awesomeness. The URL from input (yes, I should be strict and probably put a http:// or https:// in there to stop schema bullshit errors, but I’m just playing here), the site headers and text from Requests, and then we make a “soup” out of the site content using BeautifulSoup. And, believe me (unlike the statements of another prominent user of that phrase, you can believe mine!), we’ve already done a lot of work in just four lines!

As the comment indicates, the size is a fairly trivial thing, but the text? There’s a lot of interesting stuff going on there, and with only a few lines we’re all set up to do some interesting things. I’m going to find some things out about my little website here. Some of them useful, others not so much but still…

First things first, let’s see what meta tags I have. I think having more people visit my site to read my ramblings is only a good thing, so let’s chuck these in the shopping cart!

# Our list of meta tags meta_tags = [] # Find and gather all the tags! for i in soup.find_all('meta'): print(i.get('content')) meta_tags.append(i.get('content'))

And that’s our first bit of scraping awesomeness! We’ve fetched all the content tags in the site’s metadata, and chucked it in our shopping cart of info for the end report…

Next, I want to know how big my site is. Why? Just for the hell of it. And also I want Requests to get its moment in the sun.

size_in_kb = float(header_info.get('Content-Length')) * 0.001 print('Page size: ' + str(size_in_kb) + 'kb')

We now know the page’s size in kilobytes! Now let’s do something more interesting…

# kill all script and style elements for script in soup(["script", "style", '[document]','head','title']): script.extract() # rip it out # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines website_text = '\n'.join(chunk for chunk in chunks if chunk)

So, while HTML is awesome, for our purposes it’s not what we want, so we’re going to rip out all the script and style elements and then get down to learning some things about our site! BeautifulSoup makes it pretty easy to get rid of all that…

As I’m sure you’ll have guessed when reading this, I’m a fairly wordy person, so let’s see how many words are in our page:

page_words = website_text.split() word_count = Counter(page_words) # Count words total_words = 0 unique_words = 0 for key, value in word_count.items(): total_words += value if value == 1: unique_words += 1

So, courtesy of Counter from the Collections module, we now have a list of all the words in the page. A word count is always useful, and thanks to our cleaning from earlier we shouldn’t have any HTML messing the count up!

I guess I want to find out what the most common things I say are as well…

# Find the five most common words common_dict = dict(word_count) word_list = [(k, word_count[k]) for k in sorted(word_count, key=word_count.get, reverse=True)]

We have the things I say the most (as I’m going to test it on my own site), so next, let’s find out who I link to…

for i in soup.find_all('a'): if i.get("href"): if(i.string is not None): print("link to: " +i.get("href") + ", text string: " + i.string) else: print("link to: " +i.get("href"))

Again, BeautifulSoup is able to pull all the info we need from the page’s HTML. The site’s report is coming together…

Finally, let’s test if I talk about the things I’ve put in my meta tags. Is the site about what I claim it is to the world?

in_page = [] not_in_page = [] # Get all tags in page, and all tags not in page for i in meta_tags: if i in page_words: in_page.append(i) else: not_in_page.append(i)

If you’ve been following along, we should have enough to find out some basic info about a website in around 66 lines of code, or less. So, for our maiden voyage, we might as well test it on myself…

================================== Report for: http://theanswers42.com ================================== Page title: The Answer’s 42 – The answer’s out there, but are we asking the right questions? ================================== ================================== Meta tags present: None width=device-width, initial-scale=1 WordPress.com website The Answer's 42 The answer's out there, but are we asking the right questions? https://theanswers42.com/ The Answer's 42 en_US @wordpressdotcom The Answer's 42 width=device-width;height=device-height The answer's out there, but are we asking the right questions? name=Subscribe;action-uri=https://theanswers42.com/feed/;icon-uri=https://s2.wp.com/i/favicon.ico name=Sign up for a free blog;action-uri=http://wordpress.com/signup/;icon-uri=https://s2.wp.com/i/favicon.ico name=WordPress.com Support;action-uri=http://support.wordpress.com/;icon-uri=https://s2.wp.com/i/favicon.ico name=WordPress.com Forums;action-uri=http://forums.wordpress.com/;icon-uri=https://s2.wp.com/i/favicon.ico The Answer's 42 on WordPress.com The answer's out there, but are we asking the right questions? ================================== Page size: 0.178kb ================================== Total words: 612 ================================== Unique words: 211 ================================== ================================== 5 most common words: [('the', 23), ('a', 19), ('to', 16), ('new', 15), ('of', 12)] ================================== Page links: link to: #content, text string: Skip to content link to: https://theanswers42.com/, text string: The Answer’s 42 link to: https://theanswers42.com/2016/10/28/strange-delights/, text string: strange delights link to: https://theanswers42.com/2016/10/28/strange-delights/ link to: https://theanswers42.com/category/comic-book-geekery/, text string: Comic book geekery link to: https://theanswers42.com/category/movie-magic/, text string: movie magic link to: https://theanswers42.com/2016/10/28/strange-delights/#respond, text string: Leave a comment link to: https://theanswers42.com/2016/10/28/strange-delights/ link to: https://theanswers42.com/tag/doctor-strange/, text string: doctor strange link to: https://theanswers42.com/tag/marvel/, text string: marvel link to: https://theanswers42.com/tag/review/, text string: review link to: https://theanswers42.com/2016/10/23/shit-film-sunday-stranger-than-fiction/, text string: shit film sunday: the strange tale of doctor mordrid link to: https://theanswers42.com/2016/10/23/shit-film-sunday-stranger-than-fiction/ link to: https://theanswers42.com/category/movie-magic/, text string: movie magic link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized link to: https://theanswers42.com/2016/10/23/shit-film-sunday-stranger-than-fiction/#respond, text string: Leave a comment link to: https://theanswers42.com/2016/10/23/shit-film-sunday-stranger-than-fiction/ link to: https://theanswers42.com/tag/doctor-mordrid/, text string: doctor mordrid link to: https://theanswers42.com/tag/doctor-strange/, text string: doctor strange link to: https://theanswers42.com/tag/marvel/, text string: marvel link to: https://theanswers42.com/2016/10/03/really-new-year-really-new-leaf-really-new-blog/, text string: really new year, really new leaf, really new blog link to: https://theanswers42.com/2016/10/03/really-new-year-really-new-leaf-really-new-blog/, text string: October 3, 2016 link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized link to: https://theanswers42.com/2016/10/03/really-new-year-really-new-leaf-really-new-blog/#respond, text string: Leave a comment link to: https://theanswers42.com/2016/10/03/really-new-year-really-new-leaf-really-new-blog/ link to: https://theanswers42.com/2016/09/29/one-year-later/, text string: one year later… link to: https://theanswers42.com/2016/09/29/one-year-later/, text string: September 29, 2016 link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized link to: https://theanswers42.com/2016/09/29/one-year-later/#respond, text string: Leave a comment link to: https://theanswers42.com/2016/09/29/one-year-later/ link to: https://theanswers42.com/2015/09/29/new-year-new-leaf-new-blog/, text string: new year, new leaf… new blog? link to: https://theanswers42.com/2015/09/29/new-year-new-leaf-new-blog/ link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized link to: https://theanswers42.com/2015/09/29/new-year-new-leaf-new-blog/#respond, text string: Leave a comment link to: https://theanswers42.com/2015/09/29/new-year-new-leaf-new-blog/ link to: https://theanswers42.com/2014/10/29/visual-studio-meet-sdl2-sdl2-meet-visual-studio/, text string: visual studio meet SDL2, SDL2 meet visual studio… link to: https://theanswers42.com/2014/10/29/visual-studio-meet-sdl2-sdl2-meet-visual-studio/ link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized link to: https://theanswers42.com/2014/10/29/visual-studio-meet-sdl2-sdl2-meet-visual-studio/#respond, text string: Leave a comment link to: https://theanswers42.com/2014/10/29/visual-studio-meet-sdl2-sdl2-meet-visual-studio/ link to: https://theanswers42.com/tag/sdl2/, text string: SDL2 link to: https://theanswers42.com/tag/sdl2-configuration/, text string: SDL2 configuration link to: https://theanswers42.com/tag/visual-studio/, text string: Visual Studio link to: https://theanswers42.com/2014/07/27/back-to-the-beginning/, text string: back to the beginning… link to: https://theanswers42.com/2014/07/27/back-to-the-beginning/ link to: https://theanswers42.com/category/uncategorized/, text string: Uncategorized link to: https://theanswers42.com/2014/07/27/back-to-the-beginning/#respond, text string: Leave a comment link to: https://theanswers42.com/2014/07/27/back-to-the-beginning/ link to: https://theanswers42.com/page/2/ link to: https://theanswers42.com/page/5/ link to: https://theanswers42.com/page/2/, text string: Next page link to: https://wordpress.com/?ref=footer_blog, text string: Blog at WordPress.com. link to: https://theanswers42.com/, text string: The Answer’s 42 link to: https://theanswers42.com/, text string: The Answer’s 42 link to: https://wordpress.com/?ref=footer_website, text string: Create a free website or blog at WordPress.com. link to: #, text string: Cancel ================================== ================================== Meta tags in page: ['website'] ================================== Meta tags not in page[None, 'width=device-width, initial-scale=1', 'WordPress.com', "The Answer's 42", "The answer's out there, but are we asking the right questions?", 'https://theanswers42.com/', "The Answer's 42", 'https://s0.wp.com/i/blank.jpg?m=1383295312i', 'en_US', '@wordpressdotcom', "The Answer's 42", 'width=device-width;height=device-height', "The answer's out there, but are we asking the right questions?", 'name=Subscribe;action-uri=https://theanswers42.com/feed/;icon-uri=https://s2.wp.com/i/favicon.ico', 'name=Sign up for a free blog;action-uri=http://wordpress.com/signup/;icon-uri=https://s2.wp.com/i/favicon.ico', 'name=WordPress.com Support;action-uri=http://support.wordpress.com/;icon-uri=https://s2.wp.com/i/favicon.ico', 'name=WordPress.com Forums;action-uri=http://forums.wordpress.com/;icon-uri=https://s2.wp.com/i/favicon.ico', "The Answer's 42 on WordPress.com", "The answer's out there, but are we asking the right questions?"] ==================================

Hmmm, so largely self-referential, my favourite word is “the” and and the meta tags promised it’s a website and it is indeed so. And well, that’s my brief demonstration of BeautifulSoup over. It’s not particularly exhaustive, but it does show the ease with which you can extract (some) meaningful information with very little effort using Python and its awesome libraries. A good first post back!

I could improve it: add a HTML page output and so on… but right now, let’s leave it at that. Should you want to find the full source, you can find it on my Github, in the Odds and Ends repository, along with any other gibberish (or possibly useful) scripts I come up with.

That was cool! I feel the post ideas cropping up in my head already…

The Answer’s 42

The answer’s out there, but are we asking the right questions?

soup that is beautiful

Published by theanswers42

Leave a comment Cancel reply

Share this:

Related

Published by theanswers42

Leave a comment Cancel reply