Part of the Machina blogchain.
If you have not already, I recommend reading the Machina Blogchain index to provide some additional context to this post and indeed the rest of the posts in the series. Furthermore, note that this first post is the documentation of my work through a (short) project which focuses on the methodology and objectives more than the tools used to complete the work (Python 3, Google Colaboratory and GPT-2). I am planning on revisiting these tools and their use in greater depth as I progress through the blogchain but this post’s priority lies in demonstrating how I was able to use an open-source language model to create unique and interesting text outputs.
I personally find that the most efficient way to learn a topic is to put emphasis on the application rather than the theory of the tools and methodologies. By approaching problems in this way we can troubleshoot issues when we are stuck—learning enough of each tool in order to remedy the problem and move to the next step. I find learning like this tends to make much more sense when working through fields that are new to me. The alternative is that I end up reading sub-branches deemed to be ‘foundational’ but require large time investments without much sense of how they interconnect between each other.
My first project is one that has been a must-do since I discovered the AI bot wint but AI, A hilarious parody of a well-known Twitter user @dril. This account is a human-curated feed of tweets generated by a text-generating model called GPT-2, released by OpenAI.
Essentially, I want to be able to run the same project (without the public feed) and generate text in the spirit of one of my favourite bloggers, Venkatesh Rao. Last I saw, Venkatesh had around 97,000 tweets (!), which makes this account perfect for this type of project—more tweets means more data for the model to analyse and thus, better text-generation capabilities. More on this soon.
For this project, I am following a guide from Max Woolf, a data scientist with excellent knowledge of GPT-2 and a bunch of handy resources including a Google Colaboratory notebook which you can copy to your own Google Drive and use for your project.
This first step is to ‘scrape’ the tweets from the chosen Twitter account. Here, I had to install Python 3 and the Build Tools for Visual Studio 2019 to make the installation of twint work (this is a Python script dependency, outlined in the first step of Max’s guide).
With these pre-requisites installed, I downloaded and ran the Python script
py download_tweets.py vgr, targeting Venkatesh’s Twitter (I used the
py command instead of
Python3 based on a recommendation I found on Google after I had some errors on Python). Overnight, all 97,000 tweets were compiled in a .csv file (this process took a long time, be sure you have enough time left in the day when you begin, or leave this process running overnight). Note: In my case, the .csv had a blank row between each tweet, causing an error later in the process. Broken data like this is a common problem for this type of work and cleaning inputs is in important part of a successful project. In this case I simply copied my error into Google and learned that the code wasn’t being read properly due to the blanks. To fix this, I opened the file in Microsoft Excel, used the ‘Find and Replace’ tool to select all blank cells and deleted these rows.
Training the model
With this data, the next task is to ‘train’ the AI model using all of this raw text. Before this, I configured the Google Colaboratory notebook, downloaded GPT-2 and mounted my Google Drive to the notebook filesystem. These steps are straightforward thanks to Max Woolf’s pre-configured notebook. With the notebook set up, I defined the .csv file as the dataset file name and train the AI fine-tuning process. This is where the language model ‘trains’ and learns the vocabulary and style of the Twitter user. This process tends to take quite a long time but was much faster for me than the hours it took to scrape the Twitter feed.
Finally, I generated and exported a few thousand tweets into .txt files to check out. I will not share them at this time because I’m unsure where the publishing of such AI-generated text currently stands and would not release any information without the express consent of Venkatesh, as it uses his (public) data. Nevertheless, I hope to dive deeper into all aspects of this project, and in related projects, over the coming months.
The excellent guide from Max, including the pre-configured Google Colaboratory notebook, made this a relatively straightforward and enjoyable project to start my machine learning experience. I was extremely impressed by the quality of the AI and have been reading a little deeper into the nature of their language model and stumbled upon an educational resource the OpenAI team created called ‘Spinning Up’. This looks like a great way to dive into more deep reinforcement learning resources and applications and will likely be a focus of a lot of the next blogchain posts. I am also looking to drill down a bit more on how Python operates and what interesting projects I might be able to do in learning more of this language.
On the whole, I am very pleased with my first venture into machine learning and I’m sure the next post in this series won’t be too far away.