CLOSE
Original image

A Crash Course in Wikipedia Vandalism

Original image

Reader Johnny Cat wrote in to ask about which Wikipedia entries have the highest incidences of false information in them. "I'm aware that almost everything there, from Applebee's to Zorro, has errors every day," he wrote, "but something in my gut tells me there are certain topics that just attract bad submitters."


Johnny Cat—and you—will probably be as surprised as I was scrolling down Wikipedia's List of Most Vandalized Pages, because there doesn't seem to be any method to the madness of wiki vandalism as far subject matter is concerned. Among the victims of "exceptionally high vandalism" are the entries for Jack London, baseball, Halo 2, Harry Potter, piano, home improvement and buttocks. The commonality among some of the most vandalized entries seems to be that they're recent major news events, topics that are currently, or have been, subjects of controversy, or entries that are simply popular and often read.

Back up. What is wiki vandalism in the first place?

Wikipedia defines it as any "addition, removal, or change of content made in a deliberate attempt to compromise the integrity of Wikipedia," which can come in variety of flavors, such as...

Blanking: Removing all or significant parts of a page's content without any reason, or replacing entire pages with nonsense.
Page creation: Creating new pages with the intent of malicious behavior, like blatant advertising pages, personal attack pages and hoaxes.
Page lengthening: Adding large amounts of bad-faith content in order to make the page's load time abnormally long or even make it impossible to load without browser crashing.
Spam: Adding external links to non-notable or irrelevant sites or sites that have some relationship to the subject matter, but advertise or promote in the user's interest.
Silly vandalism: Adding profanity, graffiti, random characters or other nonsense to entries or creating nonsensical and non-encyclopedic pages.
Image vandalism: Uploading shock images, inappropriately placing explicit images on pages, or using images in other disruptive ways.

Once the damage is done, how long does it take to fix?

In the interest of science, Wikipedia user Colonel Chaos vandalized featured articles, the entries that are considered the cream of the Wikipedia crop. Since Wikipedia employs software created to help find easy-to-spot vandalism (like "Your mom!" or "POOP!!!!"), the Colonel engaged in slightly more complex vandalism of three types: Complete Nonsense, where passages of completely irrelevant prose were inserted into articles; Grave Factual Accuracy, where material was changed or inserted in a way that it would be obvious to the average reader or editor of Wikipedia that the material was untrue (e.g. That Martin Sheen discovered hydrochloric acid by mixing potatoes with salt and invented Agent Orange for the purpose of dissolving gold); and Factual Inaccuracy, where articles were changed slightly so a reader would need some knowledge of the topic in order to spot the inaccuracy (e.g. the article on Norman Borlaug was changed from "Between 1965 and 1970, wheat yields nearly doubled in Pakistan and India" to "Between 1968 and 1975, wheat yields nearly tripled in Pakistan and India."

The average response time on these changes were 11.5 hours for Complete Nonsense, 9.25 hours for Grave Inaccuracy and 57.4 minutes for Factual Inaccuracy. Colonel Chaos notes that for featured articles, which rotate on Wikipedia's main page and are heavily viewed, a reversion time of 10 minutes would be more appropriate.

Here are some highlights from the study:

Article Elapsed Time between vandalism and reversion
Medal of Honor 1 Minute
Hydrochloric Acid 14 hours 16 minutes (Edited by an automated bot in between Colonel Chaos' edit and the revert).
Second Crusade 42 hours 38 minutes (According to Colonel Chaos, "This one suffered another incident of vandalism and was reverted to my version before my modifications were corrected. Honestly, how long does it take to figure out that Gregory Peck, Bill Cosby, and Harry Potter didn't lead the Second Crusade and that Paul Revere wasn't involved?)

Is there any way to stop this madness?

Well, there was the plan to simply let vandals run amok on the entry for chickens. By sacrificing this article—"Dudes already know about chickens. Ladies also already know about chickens. Does an encyclopedia really need an article about nature's tastiest birds?"—it was hoped that the rest of Wikipedia would be spared. The plan, like the bird, never really got off the ground.

Then there's WikiScanner, created by Daniel Erenrich and Virgil Griffith, which allows users to trace the source of anonymous edits to Wikipedia entries and by using IP address of the anonymous user (which Wikipedia logs) to identify the owner of the computer network from which the edits were made. In the past, the tool has exposed insiders at Diebold Election Systems, Exxon and the CIA covertly deleting or changing information that was unflattering to their organizations. If you can't stop a vandal, you can at least pull back the curtain of anonymity.

Original image
iStock // Ekaterina Minaeva
technology
arrow
Man Buys Two Metric Tons of LEGO Bricks; Sorts Them Via Machine Learning
May 21, 2017
Original image
iStock // Ekaterina Minaeva

Jacques Mattheij made a small, but awesome, mistake. He went on eBay one evening and bid on a bunch of bulk LEGO brick auctions, then went to sleep. Upon waking, he discovered that he was the high bidder on many, and was now the proud owner of two tons of LEGO bricks. (This is about 4400 pounds.) He wrote, "[L]esson 1: if you win almost all bids you are bidding too high."

Mattheij had noticed that bulk, unsorted bricks sell for something like €10/kilogram, whereas sets are roughly €40/kg and rare parts go for up to €100/kg. Much of the value of the bricks is in their sorting. If he could reduce the entropy of these bins of unsorted bricks, he could make a tidy profit. While many people do this work by hand, the problem is enormous—just the kind of challenge for a computer. Mattheij writes:

There are 38000+ shapes and there are 100+ possible shades of color (you can roughly tell how old someone is by asking them what lego colors they remember from their youth).

In the following months, Mattheij built a proof-of-concept sorting system using, of course, LEGO. He broke the problem down into a series of sub-problems (including "feeding LEGO reliably from a hopper is surprisingly hard," one of those facts of nature that will stymie even the best system design). After tinkering with the prototype at length, he expanded the system to a surprisingly complex system of conveyer belts (powered by a home treadmill), various pieces of cabinetry, and "copious quantities of crazy glue."

Here's a video showing the current system running at low speed:

The key part of the system was running the bricks past a camera paired with a computer running a neural net-based image classifier. That allows the computer (when sufficiently trained on brick images) to recognize bricks and thus categorize them by color, shape, or other parameters. Remember that as bricks pass by, they can be in any orientation, can be dirty, can even be stuck to other pieces. So having a flexible software system is key to recognizing—in a fraction of a second—what a given brick is, in order to sort it out. When a match is found, a jet of compressed air pops the piece off the conveyer belt and into a waiting bin.

After much experimentation, Mattheij rewrote the software (several times in fact) to accomplish a variety of basic tasks. At its core, the system takes images from a webcam and feeds them to a neural network to do the classification. Of course, the neural net needs to be "trained" by showing it lots of images, and telling it what those images represent. Mattheij's breakthrough was allowing the machine to effectively train itself, with guidance: Running pieces through allows the system to take its own photos, make a guess, and build on that guess. As long as Mattheij corrects the incorrect guesses, he ends up with a decent (and self-reinforcing) corpus of training data. As the machine continues running, it can rack up more training, allowing it to recognize a broad variety of pieces on the fly.

Here's another video, focusing on how the pieces move on conveyer belts (running at slow speed so puny humans can follow). You can also see the air jets in action:

In an email interview, Mattheij told Mental Floss that the system currently sorts LEGO bricks into more than 50 categories. It can also be run in a color-sorting mode to bin the parts across 12 color groups. (Thus at present you'd likely do a two-pass sort on the bricks: once for shape, then a separate pass for color.) He continues to refine the system, with a focus on making its recognition abilities faster. At some point down the line, he plans to make the software portion open source. You're on your own as far as building conveyer belts, bins, and so forth.

Check out Mattheij's writeup in two parts for more information. It starts with an overview of the story, followed up with a deep dive on the software. He's also tweeting about the project (among other things). And if you look around a bit, you'll find bulk LEGO brick auctions online—it's definitely a thing!

Original image
quiz
arrow
Name the Author Based on the Character
May 23, 2017
Original image
SECTIONS
BIG QUESTIONS
BIG QUESTIONS
WEATHER WATCH
BE THE CHANGE
JOB SECRETS
QUIZZES
WORLD WAR 1
SMART SHOPPING
STONES, BONES, & WRECKS
#TBT
THE PRESIDENTS
WORDS
RETROBITUARIES