h3_html = ‘
cta = ‘
atext = ‘
scdetails = scheader.getElementsByClassName( ‘scdetails’ );
sappendHtml( scdetails, h3_html );
sappendHtml( scdetails, atext );
sappendHtml( scdetails, cta );
sappendHtml( scheader, “http://www.searchenginejournal.com/” );
sc_logo = scheader.getElementsByClassName( ‘sc-logo’ );
logo_html = ‘‘;
sappendHtml( sc_logo, logo_html );
sappendHtml( scheader, ‘
} // endif cat_head_params.sponsor_logo
2019 far exceeded my expectations by way of Python adoption inside the SEO group.
As we begin a brand new yr and I hear extra SEO professionals wanting to be part of within the enjoyable, however pissed off by the preliminary studying curve, I made a decision to write this introductory piece with the objective of getting extra folks concerned and contributing.
Most SEO work entails working with spreadsheets which you will have to redo manually when working with a number of manufacturers or repeating the identical evaluation over time.
When you implement the identical workflow in Python, you may trivially reproduce the work and even automate the entire workflow.
We are going to be taught Python fundamentals whereas finding out code John Mueller not too long ago shared on Twitter that populates Google Sheets. We will modify his code to add a easy visualization.
Using the Wikipedia API, I pulled in some extra fields for the curious :). Spreadsheet: https://t.co/OSBENEubgt – Colab code: https://t.co/sTAb1vk8N4
— 🍌 John 🍌 (@JohnMu) January three, 2020
Setting up the Python Environment
Similar to working with Excel or Google Sheets, you will have two major choices when working with Python.
You can set up and run Python in your native pc, or you may run it within the cloud utilizing Google Colab or Jupyter notebooks.
Let’s overview each.
Working with Python on Your Local Computer
I usually select to work on my Mac when there’s software program that received’t run within the cloud, for instance, after I want to automate an internet browser.
You want to obtain three software program packages:
- Visual Studio Code.
- The Python bindings for Code.
Go to https://www.anaconda.com/distribution/ to obtain and set up Python three.7 for your working system. Anaconda contains Python and a lot of the libraries that you just want for information evaluation.
This will take some time to full.
Once accomplished, search for the Anaconda Navigator and launch it.
Click to launch JupyterLab and it ought to open a brand new tab in your browser with a JupyterLab session.
Click on the massive icon to begin a Python three pocket book and you might be studying to begin sort or copy/pasting code snippets.
You can consider this pocket book as related to a brand new Excel sheet.
The subsequent step is optionally available.
Go to https://code.visualstudio.com/download and obtain and set up Visual Studio Code for your pc.
It is simpler to prototype in Jupyter notebooks and once you get all the things to work, you should use Visual Studio Code to put all the things collectively in a script or app that others can use from the command line.
Make positive to set up the Python extension for VSC. You can discover it right here.
Visual Studio Code has built-in assist for Jupyter Notebooks.
You can create one by typing the key phrase mixture Command+Shift+P and deciding on the choice “Python Jupyter Notebook”.
Working with Python within the Cloud
I do most of my Python work on Google Colab notebooks so that is my most popular choice.
Go to https://colab.research.google.com/ and you’ll skip the downloading and installations steps.
Click on the choice to begin a brand new Python three pocket book and you should have the equal of a brand new Google Sheet.
Learning the fundamentals of Python & Pandas
Mueller shared a Colab pocket book that pulls information from Wikipedia and populates and Google Sheet with that information.
Professional programmers want to be taught the ins and out of a programming language and that may take a whole lot of effort and time.
For SEO practitioners, I feel a less complicated method that entails finding out and adapting present code, may work higher. Please share your suggestions if you happen to do that and see if I’m proper.
We are going a lot of the identical fundamentals you be taught in typical Python programming tutorials with a sensible context in thoughts.
Let’s begin by saving Mueller’s pocket book to your Google Drive.
After you click on the hyperlink. Select File > Save a duplicate in Drive.
Here is the instance Google sheet with the output of the pocket book.
Mueller desires to get subject concepts that carry out higher in cellular in contrast to desktop.
What form of content material is extra helpful on cellular vs through desktop? Wikipedia to the rescue! Apparently, superstar / leisure & medical content material guidelines cellular.https://t.co/lvEdYmNPB2 … additionally, Pomeranians?
— 🍌 John 🍌 (@JohnMu) December 30, 2019
He discovered that superstar, leisure, and medical content material does greatest on cellular.
Let’s learn by way of the code and feedback to get a high-level overview of how he figured this out.
We have a number of items to the puzzle.
- An empty Google sheet with 6 prefilled columns and seven columns that want to be stuffed in
- The empty Google sheet features a Pivot desk in a separate tab that exhibits cellular views symbolize 70.59% of all views in Wikipedia
- The pocket book code populates the 7 lacking columns principally in pairs by calling a helper operate known as update_spreadsheet_rows.
- The helper operate receives the names of the columns to replace and a operate to name that may return the values for the columns.
- After all the columns are populated, we get a remaining Google sheet that features an up to date Pivot Table with a break down of the subject.
Python Building Blocks
Let’s be taught some widespread Python constructing blocks whereas we overview how Mueller’s code retrieves values to populate a few fields: the PageId and Description.
# Get the Wikipedia web page ID -- wanted for a bunch of things. Uses "Article" column def get_PageId(title): # Get web page description from Wikipedia def get_description(pageId):
We have two Python capabilities to retrieve the fields. Python capabilities are like capabilities in Google Sheets however you outline their habits in any manner you need. They take enter, course of it and return an output.
Here is the PageId we get after we name get_PageId(“Avengers: Endgame”)
Here is the Description we get after we name get_description(pageId)
'2019 superhero movie produced by Marvel Studios'
Anything after the # image is taken into account a Python remark and is ignored. You use feedback to doc the intention of the code.
Let’s step by way of, line by line, the get_PageId operate to be taught the way it will get the ID of the title of the article that we’re passing on.
# name the Wikipedia API to get the PageId of the article with the given title. q =
q is a Python dictionary. It holds key-value pairs. If you lookup the worth of “action”, you get “query” and so forth. For instance, you’d carry out such a lookup utilizing q[“action”].
“action” is a Python string. It represents textual info.
“titles”: title maps the “titles” key to the Python variable title that we handed as enter to the operate. All keys and values are hardcoded and express, besides for the final one. This is what the dictionary seems like after we execute this operate.
In the following line we’ve got.
url = "https://en.wikipedia.org/w/api.php?" + urllib.parse.urlencode(q)
Here we’ve got a Python module operate urllib.parse.urlencode. Module capabilities are similar to Google sheet capabilities that present commonplace performance.
Before we name module or library capabilities, we’d like to import the module that accommodates them.
This line on the high of the pocket book does that.
Let’s make clear the decision and see the output we get.
You can discover detailed documentation on the urlencode module operate right here. Its job is to convert a dictionary of URL parameters into a question string. A question string is the a part of the URL after the query mark.
This is the output we get after we run it.
This is what our URL definition line seems like after we add the results of urlencode.
url = "https://en.wikipedia.org/w/api.php?" + "action=query&format=json&prop=info&titles=Avengers%3A+Endgame"
The + signal right here concatenates the strings to type one.
url = "https://en.wikipedia.org/w/api.php?action=query&format=json&prop=info&titles=Avengers%3A+Endgame"
This ensuing string is the API request the pocket book sends to Wikipedia.
In the following line of code, we open the dynamically generated URL.
response = requests.get(url)
requests.get is a Python third-party module operate. You want to set up third-party libraries utilizing the Python device pip.
!pip set up --upgrade -q requests
You can run command line script and instruments from a pocket book by prepending them with !
The code after ! isn’t Python code. It is Unix shell code. This article offers a complete listing of the commonest shell instructions.
After you put in the third-party module, you want to import it such as you do with commonplace libraries.
Here is what the translated name seems like.
response = requests.get("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=info&titles=Avengers%3A+Endgame")
You can open this request within the browser and see the API response from Wikipedia. The operate name permits us to do that with out manually opening an internet browser.
The outcomes from the requests.get name will get saved within the Python variable response.
This is what the outcome seems like.
You can consider this complicated information construction as a dictionary the place some values embody different dictionaries and so forth.
The subsequent line of code slices and dices this information construction to extract the PageId.
outcome = listing(response.json()["query"]["pages"].keys())
Let’s step by way of it to see the way it will get it.
When we glance up the worth for the important thing “query”, we get a smaller dictionary.
Then, we glance up the worth of “pages” on this smaller dictionary.
We get a fair smaller one. We are drilling down on the massive response information construction.
The PageId is on the market in two locations on this slice of the information construction. As the one key, or as a price within the nested dictionary.
John made essentially the most good selection, which is to use the important thing to keep away from additional exploration.
The response from this name is a Python dictionary view of the keys. You can be taught extra about dictionary view on this article.
We have what we’re wanting for, however not in the proper format.
In the following step, we convert the dictionary view right into a Python listing.
This what the conversion seems like.
Python lists are like rows in a Google sheet. They typically comprise a number of values separated by commas, however on this case, there is just one.
Finally, we extract the one ingredient that we care about from the listing. The first one.
The first ingredient in Python lists begins at index zero.
Here is the ultimate outcome.
As that is an identifier, is healthier to hold as a string, but when we would have liked a quantity to carry out arithmetic operations, we might do one other transformation.
In this case, we get a Python integer.
The important variations between strings and integers are the kinds of operations that you may carry out with them. As you noticed earlier than we will use the + operator to concatenate two strings, but when we used the identical operator in two numbers, it could add them collectively.
"44254295" + "3" = "442542953" 44254295 + three = 44254298
As a aspect notice, I ought to point out jq, a cool command line device that permits you to slice and cube JSON responses straight from curl calls (one other superior command line device). curl permits you to do the equal of what we’re doing with the requests module right here, however with limitations.
So far we’ve discovered how to create capabilities and information sorts that enable us to extract information and filter information from third-party websites (Wikipedia in our case).
Let’s name the following operate in John’s pocket book to be taught one other necessary constructing block: move management buildings.
This is what the API URL seems like. You can strive it within the browser.
Here what the response seems like.
This is the code that can step by way of to perceive management flows in Python.
# some pages haven't got descriptions, so we won't blindly seize the worth if "terms" in rs and "description" in rs["terms"]: outcome = rs["terms"]["description"] else: outcome = "http://www.searchenginejournal.com/" return outcome
This half checks if the response construction (above) features a key named “terms”. It makes use of the Python If … Else management move operator. Control move operators are the algorithmic constructing blocks of applications in most languages, together with Python.
if "terms" in rs
If this test is profitable, we glance up the worth of such key with rs[“terms”]
We count on the outcome to be one other dictionary and test it to see if there’s a key with the worth “description”.
"description" in rs["terms"]
If each checks are profitable, then we extract and retailer the outline worth.
outcome = rs["terms"]["description"]
We count on the ultimate worth to be a Python listing, and we solely need the primary ingredient as we did earlier than.
The and Python logical operator combines each checks into one the place each want to be true for it to be true.
If the test is fake, the outline is an empty string.
outcome = "http://www.searchenginejournal.com/"
Populating Google Sheets from Python
With a stable understanding of Python fundamental constructing blocks, now we will deal with essentially the most thrilling a part of Mueller’s pocket book: mechanically populating Google Sheets with the values we’re pulling from Wikipedia.
# helper operate to replace all rows within the spreadsheet with a operate def update_spreadsheet_rows(subjectName, parameterName, operateToName, forceUpdate=False): # Go by way of spreadsheet, replace column 'subjectName' with the information calculated # by 'operateToName(parameterName)'. Show a progressbar whereas doing so. # Only calculate / replace rows with out values there, except forceUpdate=True.
Let’s step by way of some attention-grabbing elements of this operate.
The performance to replace Google Sheets is roofed by a third-party module.
We want to set up it and import it earlier than we will use it.
!pip set up --upgrade -q gspread import gspread
Mueller selected to convert the sheets into pandas information body and whereas, as he mentions within the feedback, it was not essential, however we will take the chance to be taught a bit little bit of pandas too.
update_spreadsheet_rows("PageId", "Article", get_PageId)
At the top of each helper operate that fills a column, we’ve got a name just like the one above.
We are passing the related columns and the operate that can get the corresponding values.
When you go the title of a operate with parameters in Python, you aren’t passing information however code for the operate to execute. This isn’t one thing that, so far as I do know, you are able to do in a spreadsheet.
columnNr = df.columns.get_loc(subjectName) + 1 # column variety of output subject
The very first thing we wish to know is which column we’d like to replace. When we run the code above we get 7, which is the column place of the PageId within the sheet (beginning with 1).
for index, row in df.iterrows():
In this line of code, we’ve got one other management move operator, the Python For Loops. For loops enable you to iterate over components that symbolize collections, for instance, lists and dictionaries.
In our case above, we’re iterating over a dictionary the place the index variable will maintain the important thing, and the row variable will maintain the worth.
To be extra exact, we’re iterating over a Python dictionary view, a dictionary view is sort of a read-only and quicker copy of the dictionary, which is ideal for iteration.
<generator object DataFrame.iterrows at 0x7faddb99f728>
When you print iterrows, you don’t really get the values, however a Python iterator object.
Iterators are capabilities that entry information on demand, require much less reminiscence and carry out quicker than accessing collections manually.
INDEX: 2 ROW: Article César Alonso de las Heras URL https://en.wikipedia.org/wiki/César_Alonso_de_... Views 1,944,569 PartMobile 79.06% ViewsCellular 1,537,376 ViewsDesktop 407,193 PageId 18247033 Description WikiInHyperlinks WikiOutLinks ExtOutLinks WikidataId WikidataInstance Name: 2, dtype: object sdsdsds
This is an instance iteration of the for loop. I printed the index and row values.
# if we already did it, do not recalculate except 'forceUpdate' is about. if forceUpdate or not row[fieldName]: outcome = operateToName(row[parameterName])
forceUpdate is a Python boolean worth which defaults to False. Booleans can solely be true or false.
row[“PageId”] is empty initially, so not row[“PageId”] is true and the following line will execute. The or operator permits the following line to execute for subsequent runs solely when the flag forceUpdate is true.
outcome = operateToName(get_PageId)
This is the code that calls our customized operate to get the web page ids.
The outcome worth for the instance iteration is 39728003
When you overview the operate rigorously, you’ll discover that we use df which isn’t outlined within the operate. The code that does that’s firstly of the pocket book.
# Convert to a DataFrame and render. # (A DataFrame is overkill, however I needed to play with them extra :)) import pandas as pd df = pd.DataFrame.from_records(worksheetRows)
The code makes use of the third-party module pandas to create an information body from the Google Sheet rows. I like to recommend studying this 10 minutes to pandas article to get acquainted. It is a really highly effective information manipulation library.
Finally, let’s see how to we replace the Google Sheet.
row[fieldName] = outcome # save domestically worksheet.update_cell(index+1, columnNr, outcome) # replace sheet too
This code may be translated to.
row["PageId"] = 39728003 # save domestically worksheet.update_cell(three+1, 7, 39728003) # replace sheet too
This is the code that updates the Google sheet. The variable worksheet can be not outlined within the update_spreadsheet_rows operate, however you could find it firstly of the pocket book.
# Authenticate (copy & paste key as detailed), and browse spreadsheet # (This is all the time complicated, however it works) from google.colab import auth auth.authenticate_user() import gspread from oauth2client.consumer import GoogleCredentials gc = gspread.authorize(GoogleCredentials.get_application_default()) # get all information from the spreadsheet worksheet = gc.open(spreadsheetName).sheet1 worksheetRows = worksheet.get_all_values()
I left this code for final as a result of it’s the very last thing that will get executed and it’s also extra sophisticated than the earlier code. However, it’s the very first thing you want to execute within the pocket book.
First, we import the third-party module gspread, and full an Oauth authentication in Chrome to get entry to Google Sheets.
# get all information from the spreadsheet worksheet = gc.open("Wikipedia-Views-2019").sheet1 worksheetRows = worksheet.get_all_values()
We manipulate the Google sheet with the worksheet variable and we use the worksheetRows variable to create the pandas Dataframe.
Visualizing from Python
Now we get to your homework.
I wrote code to partially reproduce John’s pivot desk and plot a easy bar chart.
Your job is to add this code to your copy of the pocket book and add print(varible_name) statements to perceive what I’m doing. This is how I analyzed John’s code.
Here is the code.
#Visualize from Python df.groupby("WikidataInstance").agg() # the aggregation does not work as a result of the numbers embody commas # This provides an error ValueError: Unable to parse string "1,038,950,248" at place zero #pd.to_numeric(df["ViewsMobile"]) # StackOverflow is your buddy :) #https://stackoverflow.com/questions/22137723/convert-number-strings-with-commas-in-pandas-dataframe-to-float import locale from locale import atoi locale.setlocale(locale.LC_NUMERIC, "http://www.searchenginejournal.com/") #df[["ViewsMobile", "ViewsDesktop"]].applymap(atoi) df["ViewsMobile"] = df["ViewsMobile"].apply(atoi) df["ViewsDesktop"] = df["ViewsDesktop"].apply(atoi) # We strive once more and it really works totals_df = df.groupby("WikidataInstance").agg() totals_df #Here we plot totals_df.head(20).plot(type="bar")
If you bought this far and wish to be taught extra, I like to recommend you comply with the hyperlinks I included within the article and observe the code snippets on this information.
At the top of most of my columns, I share attention-grabbing Python initiatives from the SEO group. Please take into account testing those that curiosity you and take into account finding out them as we did right here.
But, even higher, see the way you would possibly give you the option to add one thing easy however precious that you may share again!
Screenshot taken by writer, January 2020