A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis

We test the proposition that the number of movies in the same calendar year combining strippers and a specific subgenre cannot exceed one without the genre being fatally destroyed in popularity. We name this the CRwM Stripper Genre Collision Hypothesis after its originator, CRwM, writing at And Now The Screaming Starts. Simple computational analysis of the wikipedia English language corpus provides some, non-definitive, support.

There’s No Computational Sociology In The Champagne Room

CRwM writes:

When no systematic approach is available, then the best you can do is pick an arbitrary point and map out the trajectory of the waning trend from your specific frame of reference. So long as you admit your frame of reference is singular, you’ll allow others, each observing the same phenomenon from their distinct frames of reference, to make whatever calculations to sync up the observations. Applying this pragmatic solution to the problem, I’m defining my frame of reference thusly: A trend is creatively bankrupt when, within a single year, you get two films mashing-up the trend with strippers. — ibid

Though there may indeed be no objective measure for trend fizzlation, we can perhaps be more systematic in our investigation of any given frame of reference. Our approach is to leverage techniques from computational sociology to to a brute force search a partially structured film description corpus. Though IMDB is probably the best data source readily available to the public, its USD $15000 license fee means using it will have to wait until my long awaited funding from the Institute of Piss Farting About comes through. In the mean time, wikipedia is actually a pretty decent source. Though the English language version does not have great coverage on, e.g., foreign films, I would hypothesise that the gaps are congruent with the focus on trendspotting in this exercise. Wikipedia is, in the words of Bruce Sterling, a kind of common sense engine. It is suited to sampling what we think we know.

Though the approach might be naive enough to be described as folk computational sociology, I prefer to think of it as punk rock.

Spins A Web, Any Size

Though there is a very active tools community around wikipedia, most of it seems to be focused on productivity scripts for editors. Things like auto classification and flagging scripts are popular, and no doubt very useful to the editorial group, if the robot history on my very occassional contributions are any guide. The search toolset seemed from a brief google literature search to be either very simple and widely available (use google to hit a single page) or sophisticated papers of vector based searches implemented on the server side. Our cinematic exploration seemed to fall between these extremes.

My first crack at getting movie data out of wikipedia was to hit the film category page for a year and script a primitive web spider to suck down all the data from that starting point. The top entry on stack overflow also happens to suggest this.

Though I did get this on the way to working, and it can be seen as movieSpider.py at the github repo mentioned later, it’s a lousy approach. Not only do you have to tool about with faked headers because wikipedia doesn’t really want you to do this, you hit the same pages over and again while troubleshooting. You have to deal with the relatively unstructured format of HTML with embedded tags, which implies bucketfuls of heuristics to pull out anything meaningful. Plus if you get it working, you will want to expand the time range, and end up downloading a fair chunk of wikipedia anyay.

Takeaway Corpi

It turns out that wikipedia hosts backups of its entire database in convenient xml export formats. This includes partition by language and current version archives (without all the history and discussion). These data dumps are available here. At a couple of gig, compressed, even the fairly pathetic caps and bandwidth rates of say, Australian broadband, can deal with it in a day or so while you amuse yourself playing badminton. Once uncompressed, a recent version of English language wikipedia takes up around 30 Gb, or in other words, can fit on an iPhone 4.

Once available locally, running searches is quicker, particularly while debugging a script. Extracting a subset also becomes much simpler. Pros seem to rebuild the entire database, including indexes. Indexes didn’t seem much use to me here, as I was hitting the full content of a page, but maybe I’ve underestimated the power of the word indexing in a basic local database. At any rate, the structured data was sufficient for myself.

A short python script of a few hundred lines lets us pull out a particular subset of wikipedia according to a regular expression run as a search on the page content. If there is a hit, we save the entire page. This is found in movie.py and available at github. Building the subset file takes about seven hours on my machine. Using the regex [Category:[0-9]??? films], we can pull in any page that mentions films of a particular year. The resultant subset is a decent film corpus weighing in at a trim 292 Mb.

This same script can be used with minor modification for searching other spaces that attract wikipedia editors of a particularly pedantic and taxonomical breed. Their painstaking sifting of the world into categories is what makes tricks like this possible. You could, for instance, use it to build a wiki subset of military battles with a regex like [Category:Battles involving.*].

Once the subset database is built, we can run a similar expression search across it, but one aware of the structure of film pages – that they have a title, and a category indicating a year. We can therefore attempt a quantitative validation of the search CRwM did by pure pop culture brainpower:

$ python movie.py -e stripper -e zombie

The result of this is

Zombie Strippers -- 2008
Kiss the Bride (2008 film) -- 2008 # false positive
Zombies! Zombies! Zombies! -- 2007
I Am Virgin -- 2010
Big Tits Zombie -- 2010
Can't Hardly Wait -- 1998 # song by White Zombie on soundtrack
The Incredibly Strange Creatures Who Stopped Living and Became Mixed-Up Zombies -- 1964
Hide and Creep -- 2004
end 8
Hits: 8 Scanned: 60757

This is not a fully automated process – as I have annotated above, Kiss The Bride, though it possibly would have been enriched by either zombies or strippers, has neither. A review of Zombie Strippers is instead cited in its footnotes. Similarly this brute force text search is ignorant of synonyms – any great zombie burlesque films of the 1920s are liable to be skimmed over without comment.

We also find that the editorial consensus at wikipedia disagrees with CRwM on one crucial point – it asserts Zombie Strippers and Zombies! Zombies! Zombies! were not made in the same calendar year.

Applications

Though the script should be a productivity boost to film subgenre scholars, it still requires a great deal of human insight to make its results valuable. It works best with very concrete and widely recognized subgenre identifiers. Any more complex critical viewpoint is obscured by the lack of a shared jargon. For instance, CRwM’s example of From Dusk till Dawn as being a pomo deadpan crime flick is hardly controversial, but insufficiently universal to appear in wikipedia entries across the subgenre.

Some other notable datapoints:

  • There are no stripper werewolf films. The blue movie scene near the end of American Werewolf In London is insufficiently focused to qualify.
  • This technique confirms no vampire stripper collision counts exceeding one in a calendar year.
  • Two films in 1964 were both musicals and featured strippers: Robin and the 7 Hoods and the aforementioned and new to the author The Incredibly Strange Creatures Who Stopped Living and Became Mixed-Up Zombies. Since The Sound of Music appeared in 1965, and the genre survived at least until the popularity of Cabaret, we posit that either Robin contains insufficient stripper content to qualify as a stripper musical, or that musicals, as a full-blown genre, are outside the scope of the CRwM Stripper Genre Collision Hypothesis.

Though we believe our results should be repeatable, keeping in mind the central role of a critical human eye in this endeavour, for the convenience of those cinephiles who are interested in the output, but not technically inclined, a sorted listing of every stripper film in wikipedia is provided. This paragraph seems a fair bit creepier now than when we first thought of it.

Conclusion and future work

Searching for “stripper zombie” on wikipedia yields 108 results of varying quality. Using the techniques above this can be narrowed to six films. A film subset database built from an expression seems like something someone else could use. Say, to pitch a zombie werewolf stripper musical. Perhaps one that’s incredibly mixed up.

The two-stripper-flicks-a-year thing isn’t meant as a value judgment. It’s simply a law of the universe. — TNCITCM, you know, the article this whole post is about.

WP:Vote

John B points out (off-blog) a post on The New Republic that with its blend of political and technical metaphors sounds more like a post from early 21st century South Sea Republic: Wiki-constitutionalism.

It describes the tremendous affection South American nations have for rewriting their constitution from scratch, at a rate of once every ten years or so.

Though it’s a catchy name, Wiki-constitutionalism isn’t a great analogy. The defining aspect of C2 or wikipedia was always progressive collaborative refinement of its documents. A rewrite from scratch is more akin to what Jefferson advocated, in Cam’s words:

Jefferson believed constitution’s should be sunsetted every 25 years, so each succeeding generation can rewrite government to be a reflection of themselves. I agree. The reason republicanism has such traction is that our constitution is a 16thC document with an elected upper house thrown in. Many of the errors, skewings and inefficiencies in our system can be traced to their constitutional origins.

To continue the analogy, and reuse one that came up on SSR more than once, it is like throwing out a creaking legacy system written in VB by a million monkeys, and having a new crack team come in and rewrite it in Python (or the tech du jour).

The example of South America is, however, not reassuring. Going back to TNR:

Latin American leaders have discovered that, by packaging ever-longer lists of promises and rights alongside greater executive functions, they can make a new constitution appealing enough to the masses that they will vote for it in a referendum. The result is constitutions that are not only the shortest-lived, but also among the longest in the world. Bolivia’s and Ecuador’s recently approved constitutions have 411 and 444 articles, respectively, and read like laundry lists of guaranteed rights, such as access to mail and telephones; guarantees for culture, identity, and dignity; and shorter work-weeks. By contrast, the U.S. Constitution, the longest-serving in the world, has only seven articles and 27 amendments.

Making most of these efforts, to complete the last lap around this allegorical track, about as successful as your typical Big Redesign In The Sky.