Building Student Name Index for Yearbooks

Edward, one of our 11th grade students on the school’s yearbook team wanted to build an index of page numbers where each student appeared. Traditionally, this would entail multiple students going through every page and recording the page number for each student appearing on that page. Inevitably, some students get missed, errors are made, and some pages get skipped.

Edward, recognizing an opportunity to leverage his programming skills, wrote a Python script to collect the information in a much more efficient manner. His script reads two columns from a Google Spreadsheet into lists named name and nameraw. Then, for each name in the lists, the script searches each page of the yearbook pdf and logs each page number where the name is located. Finally, this data is written back to the Google sheet.

This script leverages Google sheets, but with minor alteration, it could easily use locally stored files instead. Note that this script depends on a number of dependencies, so don’t forget to add those that you may not yet have installed (PyPDF2, re, pathlib, and the relevant Google auth and client libraries). If you’re unfamiliar with the Google python libraries, well, that will need to be another Google search. However, if you use local files rather than reading from a Google sheet, those libraries (and the code between the dashed lines) could be excluded. You would just need to replace those lines with a couple others that would open your csv file with the list(s) of names.

This is Edward’s script and I had no hand in creating or using it. If you want to use it yourself, you’re welcome to do so. If you get stuck, check with one of your programming students or teachers first. If you’re still stuck, comment below and I’ll help out if I can.

from __future__ import print_function
import os.path      
from PyPDF2 import PdfFileReader, PdfFileWriter
from pathlib import Path
import re
import PyPDF2

#-----------------------------------------------------
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google.oauth2 import service_account

# Note that you will need to set up your own Google API credentials if
# you're using Google Sheets. This part is beyond the scope of this post.
# But you can find how to do this with a quick google search. 
SERVICE_ACCOUNT_FILE = 'keys.json'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
creds = None
creds = service_account.Credentials.from_service_account_file(
        SERVICE_ACCOUNT_FILE, scopes=SCOPES)

# The ID and range of the spreadsheet. Note that the ID below is fake
# (we're not going to post the real one on the interwebs)
# Make sure you update the ID to refer to your spreadsheet
SPREADSHEET_ID = '1DQyjh1-7T9jjU7kmg6-TNcjsCtnCspWiMbv3t2mJeM'

service = build('sheets', 'v4', credentials=creds)

# Call the Sheets API
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SPREADSHEET_ID,
                            range="Sheet1!B1:B508").execute()
result2 = sheet.values().get(spreadsheetId=SPREADSHEET_ID,
                            range="Sheet1!A1:A508").execute()
#------------------------------------------------------------
values = result.get('values')
values2 = result2.get('values')

name = []
nameraw = []
#put names from values and values2 into lists
for x in values:
        for i in x:
                name.append(i)

for x in values2:
        for i in x:
                nameraw.append(i)

#Open the Yearbook pdf file
pdf = PdfFileReader('yearbook1.pdf')

#create a 2-D array for names & associated page numbers
pagesfinal = [[]]

#This is where everything happens. For each name, add it to pagesfinal.
#Then open each page in the pdf file and search it for the current name.
#If a version of the name is found, append the page number to the student's page list.

for i in range(len(name)):
        print(nameraw[i])
        pagesfinal.append([])
        for page_num in range (pdf.numPages):    
                pageobj = pdf.getPage(page_num)
                pageinfo = pageobj.extractText()
                pageinfo = ''.join(pageinfo.split())
                name[i]= ''.join(str(name[i]).split())
                nameraw[i] = ''.join(str(nameraw[i]).split())
                if (re.search(name[i], pageinfo)) or (re.search(nameraw[i], pageinfo)) or (re.search(name[i].upper(), pageinfo)):
                        pagesfinal[i].append(page_num +1)
                        print(pagesfinal[i])

#When the above loop completes, the following line writes the data back to the Google sheet
        request = sheet.values().update(spreadsheetId=SPREADSHEET_ID, range="Sheet1!C1", valueInputOption = "USER_ENTERED", body={"values":pagesfinal}).execute()

Leave a Reply