The Tedium of the Long-Distance Analyzer
Getting LLMs to do Some of Your Work
“Data scientists spend about 45% of their time on data preparation tasks, including loading and cleaning data,” according to a survey of data scientists conducted by Anaconda.
Large Language Models like ChatGPT, BARD, Bing and others are often touted as ushering in a new age of productivity for the knowledge worker. If you haven’t dived in very deeply you might be forgiven if you think that this AI stuff is just a “party trick” or another fad that will blow over.
I think not. There are just too many good use cases where you can demonstrate to yourself that these crazy computer programs are helpful in the real world.
For example, a large part of modern work is analyzing data. And a large part of analyzing data involves tedious tasks like sourcing, extracting, cleaning and loading the data – not one bit of these vital and necessary tasks is “value-add” in the sense of the final output. However they really eat up a lot of time.
Maybe our old buddy ChatGPT (or your other, favorite Large Language Model) can help us get some of that time back.
In order to get LLMs to do more of our work for us, we need to provide them with more tools. Most of our analytical assignments rely on some amount of up-to-date information. LLMs are typically trained on data that is months or years old, and of course, not every single possible item is in their training set – so a lot of factual data just isn’t there to pull out.
Thankfully, the newest version of ChatGPT has been enhanced with plug-ins and extra functionality to overcome this limitation. We’ll look at two of these that can really enhance our productivity: Web Scraper and Code Interpreter.
The Web Scraper plug-in allows us to guide ChatGPT to access the internet and pull current information from websites – we can even specify the site and what we want to extract.
One interesting thing about LLMs is that generally, they are bad at math. Ironic, isn’t it? Some plug-ins have been developed to overcome this limitation, and recently ChatGPT has even added the ability to execute Python code right in the Chat dialog. (This is in the paid ChatGPT-4 Plus version only).
Recently I ran across a discussion on Reddit pertaining to the rates of cancer, per 100,000 population by county in California. The data looked interesting – a quick scan of the website showed a huge variation in rates between counties, with a near 3x difference between the lowest and highest rates. However, data was also included for cancer rates after adjusting for age. Since we know that cancer is mostly a disease of aging, this makes sense, but from the website it was hard to get the big picture since it was set up like a “slide show.” Irritating for data nerds!