Wiki Game Musings

Aug 8

I wrote a post earlier about The Wiki Game, which is one of my new favorite activities. If you’re interested in playing, take a look at that post in the Puzzles section. Essentially the game is just to navigate from one random article on Wikipedia to another randomly chosen article by clicking on links in the articles. The trick is to do it in as few clicks as possible.

I thought that playing the game was a Pandora’s box but I now find myself lying in bed in the wee hours of the morning staring into the bottomless abyss of intellectual possibility that this game opened up in my brain.

Specifically, I’ve been lying here thinking about how to create an automated system for finding the shortest route from one page to another. Coming up with a real solution would take quite a bit more research and some intensive tinkering but I think I’ve made some real progress in dissecting how to approach the problem.

After a long day of work, thinking about things that actually have real world impact and utility, I often go on these pointless mental tangents as a way to decompress. My wife, who is normally my partner in crime for these kinds of explorations, is asleep so instead of just lying here thinking these thoughts to myself, I’ve decided to write them down on the off chance that anyone else is interested in following along.

Before we get started, I should warn you that this is likely to get disorganized and deep in the weeds and I’m going to put entirely too much thought into ideas that probably don’t deserve it and there isn’t likely to be a satisfying resolution at the end of it and the time you spend reading the rest of this post is time that you will never get back.

Now that the disclaimer is out of the way, let’s dive in.

Fundamentally, the problem can be reduced to a simple directed graph (digraph) where each vertex represents an article and the directed edges connecting the vertices represent the hyperlinks from one article to another. (Basically think of each article as a dot on a flat plane, and each hyperlink between articles as an arrow pointing from one dot to the next.)

If the total number of articles or the average number of hyperlinks in each article were small enough then you could use a simple tree search algorithm to exhaustively search through all of the possible routes between any two given articles.

Given that the English version of wikipedia has more than six million articles and that most articles seem to have dozens or hundreds of hyperlinks, this kind of brute force approach might quickly get computationally intractable.

If I can find some information about the average number of links per article then I could do some back of the napkin math to see how quickly that decision tree branches.

I have a pretty powerful GPU with purpose-built tensor cores that could be dedicated to this project for a while but my intuition is that the number of calculations necessary to exhaustively search would be well beyond my hardware’s ability to churn through in any reasonable amount of time.

Actually, let’s make some assumptions and take a look at the math. If each article only linked to an average of 20 other articles and we limit our search to 8 hops (on the assumption that any two pages on Wikipedia can be linked in 8 hops or less) then we’re looking at an average of 25.6 billion connections to analyze between any two endpoints.

My GPU can handle on the order of 10 trillion floating point operations per second so even if you assume quite a bit of inefficiency and overhead, the calculations could probably be done in a perfectly reasonable amount of time.

If, however, the average number of links is more like 80 per article then an 8 layer deep brute force search would be looking at 1.6 quadrillion connections. Let’s assume it’s the lower number for now.

This straight up tree search approach is simple but it’s quite inefficient and inelegant so let’s look at some ways that we can improve it but before we do, there’s another big problem we need to address. If we want to analyze all of these connections then we need to store all of them in an optimized database.

If we were relying on actually loading each page as we search then it wouldn’t matter how fast the GPU is, the bottleneck would be page load times. Even if we front-loaded the project by scraping all of Wikipedia one page at a time ahead of time and cataloging the articles and links into a matrix of vertices and directed edges, loading and analyzing each page would be a long process.

If we could average one page per second it would take about 70 days. Given that Wikipedia changes all the time, this much prep time seems problematic. Fortunately, Wikipedia allows you to download a full copy of the entire encyclopedia in one neat little package which means that we can do away with the whole page load problem and just worry about analyzing the articles and links and converting them into a matrix representing the corresponding digraph.

I have no idea how long that might take but I suspect that my computer could handle it in a reasonable amount of time so let’s see if we can’t make some improvements to our very unrefined algorithm.

Once we have our digraph representation of Wikipedia, the path from one article to another just becomes a route optimization problem. The good news is that there are plenty of really good path search algorithms out there. The bad news is that the ones I’m familiar with require a weighted graph in order to figure out how to effectively “prune” the decision tree.

At this point, our graph is directed but it is not weighted which precludes the use of the path search algorithms I know about, like the A* search algorithm, unless we can come up with some heuristic to weight the links between articles.

I started thinking about assigning weights based on the number of links that articles have in common with other articles but this starts us down a whole new rabbit hole that I’m not prepared to explore at the moment. Besides, weighting all those connections would add a significant amount of processing time to the initial setup of the database and figuring out how to model that added time sounds complicated so let’s just drop this line of thought for the time being.

There are probably other path search algorithms out there that don’t rely on assigning weights to the edges but I don’t know anything about them so, again, this seems like a rabbit hole for another time.

One thing that we can do to simplify the decision tree is make sure that we don’t end up going back and forth between articles. If we add a function to our algorithm that prevents it from going back to any vertex that it has already visited then we can pretty significantly reduce the rate at which the decision tree expands and even put some upper bounds on the number of calculations that will need to be performed.

It’s way past my bedtime so I’ll leave the rest of the heavy lifting for another time and just leave you with a couple final thoughts.

I suspect that there are a relatively small number of articles on Wikipedia that have a TON of incoming and outgoing hyperlinks. If you could identify these “supernodes” and map the connections in their vicinity then you could create a hub and spoke based graph traversal system. Basically instead of trying to haphazardly get from one random article to another, you could attempt to pass through one of these supernodes on the assumption that it’s relatively easy to get from any article to a supernode and it’s relatively easy to get from a supernode to any other article.

This is basically the system that services like FedEx and UPS use to deliver packages efficiently. The problem with this approach is that while on average it’s likely to give a relatively efficient way to move around between articles, in any specific instance, it’s unlikely to give you the optimal path from one article to another.

The final undeveloped thought that I’ve been playing around with is the idea of creating a system to sit on top of a GPT-3 API. If you’re not familiar with GPT-3, it’s a deep learning-based language model developed by OpenAI. It’s one of the most fascinating things to come out of the field of machine learning and you should definitely look it up and learn more about it. I promise it will blow your mind. It feels like magic and it does things that we used to think only humans were capable of.

If you want to learn about GPT-3 but you’re too lazy to look it up, give me a call. I love taking about stuff like this and I’d be happy to give you a primer on GPT-3. Maybe I’ll write a post all about GPT-3 sometime.

That’s it. We didn’t actually solve the problem of creating an efficient algorithm to play the Wiki game but I think we made some good progress towards an eventual solution, especially given the fact that all this has been done in bed under the covers next to a sleeping wife and a snuggling cat.

If you’ve made it this far, thanks for joining me on this most pointless of journeys. If you have any ideas you want to contribute or if you have any questions about my long, rambling, and incomplete train of thought above, drop a line in the comments and we can chat about it tomorrow after I get some rest.

Wiki Game

ThoughtPlane

Wiki Game Musings

Thoughts & Essays

Riddles & Puzzles

Discussion Questions

About | Contact

Wiki Game Musings

Gwaihir

Million Dollar Nonsense

Thoughts & Essays

Riddles & Puzzles

Discussion Questions

About | Contact