Monday, July 6, 2015

Graph databases

UPDATE: As per frickin' usual, the tinkerpop documentation has completely changed, so the Titan parts of this guide are no longer accurate (not even close). Fortunately, their tutorials are a little more complete this time, so I recommend starting there if you want to use Titan. I'll write a new post soon if I get a chance...

Audience

This post is intended for people who know what a graph is, but are completely new to graph databases. If, like me, you've tried to understand all the lingo surrounding them, or actually tried to get started using one, you have probably gotten very frustrated. It helps to realize that this space is still very bleeding-edge, as in:

"On the cutting edge, you cut. On the bleeding edge, you bleed."

These are my notes that I've collected over the last few weeks – they're just my impressions from the documentation that I've encountered. If you feel that I've misrepresented your favorite tool / framework / whatever, please correct me, or, better yet, fix your tools / documentation so that people don't come off with these impressions.

Disclaimer: if you're on Windows, commands may be a bit different.

Getting started quickly

Neo4j and Titan are probably your best options when it comes to easy installation. They are database systems (like Postgres or mongodb), except they are dedicated to graphs (Titan is technically more of a layer on top of HBase, Cassandra, or BerkleyDB, but for getting started you might as well think of it as its own thing).

Both have a package that you can just download, decompress, run a script in the bin/ directory, and you'll technically be in business.

Neo4j

Neo4j is the easiest:

Download the community edition here, and decompress it anywhere you like (you'll be running the server from the folder, in case that helps your decision). Then on the command line, type:

./neo4j-community-2.2.3/bin/neo4j start

If you get warnings / errors about your JVM (e.g. you're using OS X), you'll want to install Java 7 (I don't think 8 is supported yet), and add this to ~/.bash_profile:

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home

Next, open http://localhost:8080 in your browser.

At this point you probably don't need this guide (you probably don't need it to begin with), but if you're like me, you want at least a rough idea of whether learning to use Neo4j is worth your time. Here are my impressions:

Cypher (neo4j's special query language) is really easy to learn and use, but it's not quite as powerful as Gremlin (more on that later)... so if you start building a tool that uses Cypher, you'll be locked in with Neo4j. This is a problem, because the community (free) edition of Neo4j doesn't scale beyond one machine. On the flip side, if your project uses Gremlin, it should work with a variety of database backends, including (allegedly) Neo4j. As of this writing, it doesn't actually work (something to do with Gremlin not working with Neo4j version 2...) but it should eventually.

That said, learning Cypher was worth it, even though it's not going to end up as part of the tool I'm developing. Normally, I hate it when people say things like this (I've got a deadline, man! I don't have time for this!), but I think in this case, just getting a feel for graph databases in a non-threatening environment really may save you time in the long run. Every other graph database environment that I've encountered has been pretty threatening – outside of Neo4j, documentation is usually very poor, and installations and configurations are very elaborate.

Titan

Titan's pre-packaged bundles are probably the easiest way to get started with something scalable that works with the Gremlin query language out of the box:

https://github.com/thinkaurelius/titan/wiki/Downloads
If you don't know the difference, just go with the latest Hadoop 2 bundle.

Once you decide to work outside the cozy world of Neo4j and Cypher, you're suddenly exposed to all kinds of jargon; sometimes it's not even easy to figure out what you're downloading. And, unfortunately, there are lots of out-of-date guides out there, so as you google around, be aware that what you are reading may not even apply any more. Welcome to the bleeding edge.

To get started with Titan, run

./titan-0.5.4-hadoop2/bin/titan.sh start

and go to http://localhost:8182/doghouse

There's a nice web interface, where you can write Gremlin, search and visualize bits of the graph. As a visualization researcher, I was actually somewhat impressed – it seems to do one simple thing reasonably well (traversing the graph one node at a time), and doesn't commit any of the typical heinous visualization sins.

Titan's documentation has a good walkthrough that will give you a taste for Gremlin.

Jargon!

Technically, the interface you see at localhost:8182/doghouse is part of the Rexster project that Titan has nicely included for you. Rexster, in turn, is part of the Tinkerpop stack. The Tinkerpop stack consists of several projects, including Blueprints, Rexster, Furnace, Frames, Gremlin, and Pipes. What's really confusing about each of these is that each project defines itself in terms of the others – it's really hard to parse what the *#&! each does, and why I should care.

With Titan's nice bundle, we can ignore most of this crap, and only worry about Rexster and Gremlin. Rexster is a "graph server" – meaning something that exposes one or more graph databases to make it easy for programs to use the database without caring what the database actually is. In other words, as long as I've written my queries with the Gremlin language, I shouldn't have to care whether the database is Titan or Neo4j (I think Blueprints is technically the project that encapsulates this abstract idea, and Rexster is the software layer that makes it possible).

Initially, I tried setting up Rexster to talk to Neo4j, but aside from the configuration file hell, there are compatibility issues between the two – for learning purposes and initial development with Gremlin, you're probably best off if you just use Titan's bundle.

Coming Soon: Using the database

The web interfaces are nice, but how to get my program to communicate? There are great libraries for connecting directly to Neo4j (py2neo, etc), but there isn't really much that can connect to Rexster. I'm in the middle of this mess – hopefully I'll do another post soon with more clear guidance on how to load data and query from Node.js. For now, in case it's useful, here are some of my notes:

Mistake #1: the bulbs python libary. It's really buggy and I don't think it's maintained anymore.

Mistake #2: tried to load packets of data with grex (a Node.js library), but ran into errors with no way to debug. I think I'll be able to use grex to query, but loading data is a different story. Interestingly, Titan lists bulk-loading as an "Advanced Topic"... yet it looks like the only place in the documentation that actually talks about how to get data in.

Current approach (probably a mistake): I'm reshaping my data into "GraphSON format" (not a real thing yet, afaik), that I'll just load with the gremlin console in the web interface. Weird nuances / notes so far:

  • g.loadGraphJSON() doesn't exist anymore – it was replaced by g.loadGraphSON()
  • I've discovered two approaches to GraphSON: while the Titan bundle ships with example files that are lists of nodes that respectively contain lists of edges, g.loadGraphSON() expects separate node and edge lists...

Thursday, November 14, 2013

MacFusion, OSXFUSE, and Mavericks

Of course, with another version of OS X, we have more problems with Macfusion.

The first problem you'll encounter with a fresh, normal installation is the "Authentication has failed" error. With the help of this thread, it turns out that XQuartz needs to be installed as well (I know, that's intuitive...).

Here's the current workaround:

After this, I got the good old "Mount process has terminated unexpectedly" error when I tried to mount my home directory on the server as an SSHFS volume. After inspecting logs in the Console, it looks like it doesn't like an empty or relative Path (I had tried leaving it blank or setting it to ~/). It works with absolute paths, though.

Wednesday, November 13, 2013

Two-way data binding in brython: knockout.js

I spent the last few days trying to learn AngularJS, and was resigned to the thought that I was going to have to suck it up and write my current project in straight javascript. I was getting frustrated with all the baggage that comes with AngularJS and its awful documentation (I still don't have a proper understanding of directives!) when I stumbled on KnockoutJS. It focuses on doing one thing very well, and doesn't force a particular flavor of MVC on you. Its documentation is awesome and has an approach to tutorials that, in my humble opinion, are more natural than the ones at Khan Academy.

And there's an interesting pattern in how it wraps observables.

To get around problems with IE, you access and modify KnockoutJS variables via functions. This has a handy side effect in that everything it touches is sanitized. In other words, you can have observable brython objects that are modified and returned directly without any javascript baggage! You don't even have to wrap the ko library in a JSObject... just use it directly in your python script. The only thing you have to remember is which objects are observable and which ones aren't - but you'd have to do that in javascript, too.

There is an oddity, however; KnockoutJS doesn't call brython class functions for events properly (of course, it's expecting javascript). You can get around it, though, by wrapping the function with lambda.

Here's an example:

index.html:
 <!doctype html>  
 <html>  
   
 <head>  
   <script type="application/javascript" , src="brython.js"></script>  
   <script type="application/javascript", src="knockout-3.0.0.js"></script>  
   <script type="text/python", src="exvg.py"></script>  
 </head>  
   
 <body onload="brython()">  
   <table>  
     <thead>  
       <tr>  
         <td>x</td>  
         <td>y</td>  
       </tr>  
     </thead>  
     <tbody>  
     <tr>  
       <td><input data-bind="value: x" /></td>  
       <td><input data-bind="value: y" /></td>  
       </tr>  
     </tbody>  
   </table>  
   <svg style="width:500px;height:200px;border:1px solid">  
     <circle r="10" data-bind="attr: {cx: x, cy: y}, event: { mousedown: mouseDown }"/>  
   </svg>  
 </body>  
   
 </html>  

exvg.py:
 class Point:  
   def __init__(self):  
     self.x = ko.observable(100)  
     self.y = ko.observable(50)  
       
     # This is a little funky:  
     self.mouseDown = lambda data, event : self._mouseDown(event)  
       
   def _mouseDown(self, event):  
     # I want to drag relative to the parent SVG element  
     containerRect = event.target.parent.getBoundingClientRect()  
     startX = containerRect.left  
     startY = containerRect.top  
       
     def mouseMove(event):  
       self.x(event.clientX-startX)  
       self.y(event.clientY-startY)  
       
     def mouseUp(event):  
       # In theory I should unbind the specific function, but brython  
       # chokes when I try to unbind mouseUp from mouseup  
       doc.unbind('mouseup')  
       doc.unbind('mousemove')  
       
     # I override other events until we're done dragging  
     event.preventDefault()  
     doc.bind('mousemove',mouseMove)  
     doc.bind('mouseup',mouseUp)  
   
 ko.applyBindings(Point())  

The result:

Wednesday, August 14, 2013

Toshiba Portégé M200 OS installation

The problem with the Toshiba Portégé M200 tablet is it can only boot from a select few external CD drives. It can't boot from USB, so if you don't have one of these special CD drives, you're stuck with messing with an SD Card (I don't have a reader for any other computer that isn't a camera), or setting up a net install.

Or not.

For this guide, you will need a PATA/USB cable, another computer that can boot from a CD or flash drive, and at least one spare flash drive (you will need two if you boot Parted Magic from a flash drive).

I wasted a whole day trying to use Unetbootin to install the xubuntu .iso to a small partition (1GB), and then use that to install to the rest of the drive. I almost pulled it off, but the installer insisted on unmounting the installation partition. Here are a few notes, though, in case you try something similar:

  • I don't think the Portégé works with Unetbootin disks created in OS X (my guess is they need to be ext4 formatted, which OS X doesn't support)
  • The Portégé is non-PAE hardware; make sure whatever you install is compatible (in my case with xubuntu, use 12.04)
I eventually gave up and did something a little more complicated, but probably cleaner in the long run:

Boot another computer with a Parted Magic CD or flash drive, and plug in (1) the Portégé hard drive via PATA/USB and (2) a reasonably large spare flash drive (mine was 4GB). Using the Parted Magic Utilities:
  1. Format both the hard drive and spare flash drive to ext4 with the Partition Editor (gparted)
  2. Use Unetbootin to make the hard drive a bootable installation disk for your desired distribution
  3. Unmount the hard drive and spare flash drive, put the hard drive back in the Portégé, and plug in the flash drive
  4. Turn on the Portégé, and install the distribution to the flash drive (it might be a good idea to try the live version first to make sure the stylus drivers, etc. in your distribution work like you think they should... for the record, xubuntu works just fine!).
  5. When the Portégé reboots after installation, it will start up to the Unetbootin screen again (remember, it can't boot from USB)... just turn the machine off (you have to hold the power switch for a second).
  6. Remove the hard drive and flash drive, and plug them back into the computer running Parted Magic
  7. Using gparted, re-format the hard drive to ext4 (this may not be strictly necessary, but it's a good way to check which drive is sda, sdb, etc. before the next step)
  8. Using the Disk Cloning utility, do a local disk to local disk copy... copy the flash drive to the hard drive, and copy the bootloader when asked
  9. Open gparted again:
    1. Delete the swap and extended partition so you can expand the first (make a note of how big the extended partition was... I think the installer picks 510MB)
    2. Expand the first partition to fill the disk, leaving room to recreate the swap space at the end
    3. Recreate the extended partition and the swap space inside it
    4. Click Apply (you may need to create the swap space twice... it failed the first time I tried it)
  10. Finally, put the hard drive back in the Portégé, and start it up... if you did exactly what I did, you'll probably see some anomalies in the loading screen (mine was all fuzzy), but everything's fine once the desktop comes up.
  11. Open Synaptic Package Manager, click "Mark All Upgrades", and then click "Apply." At some point in the installation, it will ask about where to install grub - I installed on both sda and sda1.

Monday, June 10, 2013

epstopdf on OS X

I was having some issues with .eps files in TeXShop and the default installation of MacTeX on OS X Lion. Specifically, I was getting this error:

repstopdf: command not found

After much googling, I was able to figure out that I need to add the "--shell-escape" option to both fields in TeXShop -> Preferences -> Engine:

Tex:
pdftex --shell-escape --file-line-error --synctex=1

Latex:
pdflatex --shell-escape --file-line-error --synctex=1

Now I get this error:

epstopdf: command not found

At least this is supposed to be the name of the tool we need. After much more misleading googling, I decided to hack it myself. Here's my solution:

curl -O http://mirrors.ctan.org/support/epstopdf.zip
unzip epstopdf.zip
cd epstopdf
chmod a+x epstopdf.pl
mv epstopdf.pl epstopdf
sudo mv epstopdf /usr/textbin
cd ..
rm -rf epstopdf*

Hope that's useful for someone out there.

Friday, April 19, 2013

Workaround: How to export video from Tableau

Quick disclaimer: I'm running Tableau in VirtualBox on a mac. Much of this trick relies on the ability of OS X's Automator to convert pages of a PDF to multiple PNG files, and I also use a mac app called FrameByFrame. I'm sure there are Windows-only ways of doing this, but you'll need to do some additional googling.

If you drag something to the Pages shelf in Tableau, there are lots of options that would be useful in animation. Animation may or may not be appropriate for your tasks and data, but we'll assume for this post that it is. As useful as some of the tools are, though, each page is updated straight from the data, making for choppy playback that's difficult to tune. Worse, you can only export pages one at a time via Worksheet -> Export -> Image...

Here's the workaround to get a video you can control; it involves a couple external programs. As mentioned above, you'll need something that will convert the pages of a PDF to images, and you'll also need FrameByFrame (which is free). You'll probably also need a healthy amount of free hard drive space if this thing is going to be very long.

In Tableau, go to File -> Page Setup... and check "Show all pages"


As we're going to sneak this out of Tableau via printing to PDF, realize your page size will determine the final shape and resolution of your video. You may want to tweak Layout setting here as well.

Next, go to File -> Print to PDF... Again, you'll want to choose a page size that has the right proportions. Because it has to render every frame, this step will take a while after you click "Ok".

Once you've gotten your .pdf, create this workflow in Automator (please don't use .jpg, or you'll have an ugly mess!):



After running it on your .pdf, you'll have a lot of .png files named something like "results 012.png".

Now open FrameByFrame. Go to Edit -> Import Images... Select all the images you just generated. FrameByFrame gives you options for creating your animation - play with these settings until you're satisfied:


In my case, I actually want a low (3 fps) frame rate, but the awesome thing here is now you have control - the number of frames is data-driven in Tableau, and the actual playback speed can be tuned for perception. When you're done, go to File -> Export... to save your video. Here's one I generated, showing the last 30 days of earthquake data.

Friday, March 29, 2013

Efficiently running a SimpleHTTPServer from Eclipse

I've started working a lot more with straight javascript and HTML, and I've never been quite satisfied with the workflow in Eclipse. Sure, there's a built-in web browser or real-ish Apache or Tomcat servers you can set up with a lot of reading and patience. There's also a cute Eclipse web browser that rarely renders anything the same as Chrome, Firefox, Safari, or anything that anyone actually uses... that, and you're just looking at files like you'd see them if, in Chrome, you went to File -> Open File...

Sometimes all you need is just a quick way to actually serve files and see how they behave in a real browser. There probably is a better way to do this in Eclipse, but I couldn't find any good documentation about how to pull it off (if you know how, please comment!).

Here's my workaround for OS X (it shouldn't be too hard to adapt to Linux or Windows):

Create a file like this in the project:


#!/bin/bash
cd `dirname "${BASH_SOURCE[0]}"`
python -m SimpleHTTPServer 8123

Save it as "run.command" in the directory you want to serve files from.

Open a terminal, cd to the directory, and chmod a+x run.command

In Eclipse, right+click on the file, and go to "Open With -> System Editor"

From now on, every time you double click run.command, a server will start up (you'll get a terminal window that spits out python's logs). When you're done testing, you'll want to hit Control+C in this window to shut it down. To prevent lots of terminal windows from accumulating, you can go to "Terminal -> Preferences... -> Settings -> Shell" and under "When the shell exits", select "Close if the shell exited cleanly"

Point your favorite web browser at http://localhost:8123. You should see the files of that directory listed, and pages should display normally.

For an even more efficient workflow, make a system bookmark by dragging the icon next to the address to the desktop.

OS X will call it something like "Directory Listing for -.webloc". Rename it to "run.webloc" and add it to your project directory next to run.command.

Now to test a project, there are just two double clicks: first on run.command, then on run.webloc. When you're done testing, close the browser window, click the terminal, and hit Control-C.