Standardizing Publisher Names
A single publisher's name is often written in a variety of ways, making it hard to review aggregate data on purchasing or circulation. For instance, HarperCollins, HarperCollins Publishers, HarperCollins Publishers Inc. The attached spreadsheet can help you clean up the data so that publishers' names are written in a standardized format.
The instructions below assume that you have a spreadsheet of some sort that contains a column with publisher names in it. This column should not contain the entire 260 field (i.e. City: Publisher, Year), just the publisher name.
Two Ways to Use This Tool
Using Relational Databases
Software that allows relational databases includes Access, PowerBI, and Tableau. What this means is you will import multiple tables from Excel into the software and then create a relationship between the two tables. You can then create a query that draws some information from the main table (e.g., circulation, the cost of a book) and some from the Publisher Standardization table (the standardized name of the publisher).
- Open the software you want to use a create a new database or report.
- Import the data that you want to analyze and the Publisher Standardization Tool into the software.
- Build the relationship between the main table and the Publisher Standardization Tool (which will appear as a table named "Publishers").
- In Access, click the Database Tools tab at the top, then Relationships. Select the two tables, then use the mouse to draw a relationship between whatever field in the main table contains publisher information and the field called "Publisher - Raw" in the Publishers table.
- In PowerBI, click Manage Relationships in the top toolbar. Click New. In the first dropdown menu, select the main table. A preview of the table will appear below the dropdown. Highlight the field in the main table that contains publisher information by clicking on it. In the second dropdown menu, select the table called Publishers, and highlight the field called "Publisher - Raw". Make sure the box is checked next to "Make this relationship active." Click OK at the bottom of the window. Close the Manage Relationships window.
- In Tableau, drag the two tables to the middle of the screen where it says "Drag sheets here." It will automatically connect the two tables and will prompt you to select which kind of join you want to use. Select Left Join if your main data table is on the left, Right Join is the main table is on the right. In the dropdown under Data Source, select the field in the main table that contains publisher information. In the dropdown under Publishers, select "Publisher - Raw."
- Proceed to create your query, report, or visualization. When you want to do any analysis by publisher, use the field from the Publishers table called "Publisher/Imprint - Standardized." Do not create queries with the original publishers data from the main table or the field from the Publishers table called "Publisher - Raw."
- If you have publisher names in your data that do not match anything in the Publisher Standardization Tool, please add to the tool using the instructions below.
Using Excel
In Excel, you can create an added column that contains the standardized data.
- Open the spreadsheet that contains the data you want to analyze.
- Save the Publisher Standardization Tool on your computer. Keep it open.
- Find the column that has publisher names in it. For the purpose of instruction, imagine the publisher name is in column B.
- Create a new column anywhere to the right of column B. Name it "Publisher Standardized."
- In the row under the column header (row 2), begin typing the formula: =VLOOKUP(B2,
- after typing the comma, click over to the Publisher Standardization Tool. Hover the mouse over the top of column A and hold down the mouse to highlight columns A and B. Release the mouse but do not click anywhere else.
- continue typing: ,2,FALSE)
- Hit enter. Your formula will look like this: =VLOOKUP(B2,'[Publisher Standardization Tool.xlsx]Publishers'!$A:$B,2,FALSE)
- If the raw publisher data in cell B2 matches a value in the "Publisher - Raw" column, the formula will return the corresponding value from the "Publisher/Imprint - Standardized" column. If it does not match, the formula will return #N/A.
- To fill in the entire column with the formula, click on the cell where you typed the formula, and move the mouse to the bottom right corner of the cell. The cursor will turn to a + sign. Double-click. The column should automatically populate with the formula.
- You may want to sort the Publisher Standardized column so you can easily find the values that did not match.
- If you have publisher names in your data that do not match anything in the Publisher Standardization Tool, please add to the tool using the instructions below.
Adding to the Tool
If you have additional iterations of publisher names, feel free to add them to the local version of the tool that you have saved. Feel free to email me your additions so I can update the public version.
- Please do not add rows that contain the full 260 field or other information such as the copyright year. Other than that, any kind of formatting is fine, since cleaning up inconsistent formatting is the purpose of the document.
- You should have a list of publishers that did not match anything in the Publisher Standardization Tool. Scroll to the bottom of the spreadsheet and paste your list at the bottom of column A.
- In Excel, click the Data at top of the toolbar, then click Remove Duplicates.
- Make sure the box next to "Publisher - Raw" is checked and the box next to "Publisher/Imprint - Standardized" is unchecked.
- Click OK.
- Go back to the point in column A where you added your new values and fill in the corresponding, standardized values in column B.
- Make sure you are not introducing new variations on a publisher's name in column B! For instance, Human Kinetics is listed as a value in column B, so you would not want to add Human Kinetics Publishers in this column, as this would defeat the goal of standardization. I recommend keeping two copies of the spreadsheet open - one where you are making your additions, and one where you check which standardized names are already in the document.
- Send me your changes! It will be helpful if you highlight them with color so I can tell what is new.