Parsing PDF-extracted text through Excel & VBA
Situation:
Extracting data as text from a PDF is never as simple as it first seems. For this project, the data in the PDF, after being extracted, had no natural delimiters except for spaces. However, some of the columns contained proper nouns for cities, which would mess up Excel when going to do a data import from the extracted text.
Starting with a script I found to clear out non-numeric values in a single column with VBA, I came up with this:
After importing space-delimited text, you will naturally be left with an excel document that has unevenly distributed data. If the data you want to get to is purely numeric, then you can run this script.
It jumps through each row, shifting all numeric data left so that only separate numeric values appear on each column. If only numeric data was desired, this should do the trick.
The data set I worked on was California Department of Public Health Service Planning Area by Zip Code data. I’m thinking of making a form for anyone to type in their zip code and get back their SPA, but i’m not sure if it would be in the public interest. If I get any comments that it’s desired, i’ll be happy to put it up.