Audience Insights query engine: In-memory integer store for social analytics

November 19, 2014 § Leave a comment

I have recently published the details of the query engine that we have developed for Audience Insights product at Facebook:

Audience Insights (AI) is a tool designed to help Page owners learn more about the people on Facebook who matter most to them. This product visualizes anonymous and aggregate demographic, psychographic, geographic, and behavioral data (both Facebook native and third party) about a given set of people.

AI is powered by a query engine with a hybrid integer store that organizes data in memory and on flash disks so that a query can process terabytes of data in real time on a distributed set of machines. AI query engine is a fan-out distributed system with an aggregation tier and a leaf (data) tier. The aggregator sends a query request to all data nodes, which then execute the query and send back the local results to be aggregated.

Please see the original post for details:

Sweden Journey – Diary Entry II

July 22, 2013 § Leave a comment

6:43am, Sunday, September 11, 2011 – In Oslo

I am waiting for the cafe to open. It says 8am for Sundays. I am hungry. It is in the same building as the hostel.

I went to bed around 9pm last night and had a very good sleep until 4:30am. This hostel is not noisy at all. I do not understand why the door in the other hostel has to be that loudly…

I have changed my return travel because I have a meeting that I have to make and conflicts with the return train. The train ticket says internet connection during the ride. But it didn’t have, and the return ticket is with the same train –a very old one. I have booked a noon flight for Monday. Flight seems to take only one hour. Fortunately, it seems to be very easy to go to airport from central station, which is also close where I stay.

It is now raining crazy in Oslo, and seems to be rainy whole day and tomorrow! I wish I had brought some rain wear. I was planning to walk around. I think I need to change my plans, and figure out bus routes and schedules.

Phi-Cluster: Fault Tolerant Distributed Task Execution Engine

January 20, 2013 § Leave a comment

Sometime ago, I was looking at Akka Cluster for fault tolerant task execution in a cluster. Consider a service that performs certain operations, and it needs to scale. You would be designing something like a cluster of executors to perform your tasks. So, you have multiple machines. And put this on cloud. Now, you have bunch of virtual machines. Cloud is inherently less reliable, and have more machine, I mean, virtual machine failures. You need to deal with the tasks begin executed- or scheduled to be executed by a machine that has just failed. You probably do not want the tasks to be resubmitted by the client of the system.

Akka cluster takes the ideas from Dynamo paper and applied them to behavior of entities besides the data, which Dynamo tried to make highly available while scaling. However, Akka cluster is not ready, and fairly complex software system.

While I have had faith in Akka cluster, I wanted to experiment something simpler as Akka cluster is making its progress. I have developed a cluster framework, which I call Phi-Cluster. “Phi” comes from its failure detector, which I took the idea from Phi Accrual Failure Detector paper. I’ll talk about it later.

Phi-cluster is a naive effort to design a simple framework for a coordinated cluster of task executors that replicates task states and monitor each other to recover the tasks of failed nodes automatically to reduce the need for resubmitting and/or recreating those tasks. It uses a paxos based consensus service (Zookeeper), and accrual failure detector. It’s decentralized, no leader node.

The idea is quite simple. All tasks to be executed are represented in the consensus service, and the tasks assignments are replicated to a number of nodes. Each node registers heartbeat, and also runs a failure detector to monitor the nodes for which it has task replication (actually, not the task itself replicated, rather, its assignment). When the failure detector detects a failed node, it tries to reclaim the tasks and put them back into the available tasks queue of the distributed task pool. It also provides means to create and maintain task state in the consensus service. I know this may not be clear yet. I’ll try to clarify below. I also know this does not sound like something extremely scalable. Well, the main focus here is really the fault tolerance without creating something complex, and it is more for systems with long-running tasks.

Here is a more technical explanation of the system (from its documentation):

Phi-cluster has three main elements:
1. Paxos consensus based distributed task pool and state management.
2. Decentralized failure detector based on The Phi Accrual Failure Detector technique.
3. Simple, pluggable task definition, and task executor.

All or a subset of the nodes in phi-cluster run Zookeeper service that phi-cluster uses to keep task data/state, and node heartbeats for failure detector that is based on The Phi-Accrual Failure Detector where each node decides whether the nodes of interest failed based on a configured heartbeat-miss threshold. For this, each node registers heartbeat in the distributed state.

Phi-cluster admits tasks into the cluster and puts them into distributed task pool to make them available for cluster nodes to pick. Cluster nodes attempt to pick an available task in the task pool’s queue, and only one of them succeeds. Synchronization is provided through Zookeeper service, which is a majority based consensus system based on Paxos algorithm.

Part of taking a task from task pool, the node that is going execute the task designates a number of other nodes that would attempt to reclaim the task if the node failed before or during the execution of the task. This is a similar idea to replication but here task is not replicated, rather, its id is put into a secondary task queue of some other nodes.

phi-cluster - task replication

Each node monitors its secondary tasks queue, and uses accrual failure detector technique to make a failure decision for the nodes for which it has tasks in its secondary queue. If it reaches a failure decision for any such node, it goes over its secondary queue and tries to reclaim the tasks that are owned by the failed node. Note that it does not try to take the ownership of the tasks, rather, it tries to remove the existing ownership of the tasks to make them available in the distributed task pool so that they can be picked by any node of the cluster. This is why we call it “reclaiming” the tasks.

Whether the reclaimed tasks are re-executed or resumed on another node is up to the implementer of the tasks. Phi-cluster provides facility to keep task states across the cluster, allowing tasks to update its state machines so that they can be resumed from the last state. The extent a task can be reclaimed, re-executed or resumed is up to its implementation. Phi-cluster provides necessary primitives and runtime environment to make that possible.

Phi-cluster is a decentralized cluster, and there is no leader node. However, phi-cluster does not have split-brain issue since it uses a consensus service to keep the cluster state, and it is always the majority of nodes that will win in case of split-brain –nodes in the minority will not be able to function while the ones in the majority will consider them failed. Note that we mean phi-cluster does not explicitly have leader selection but Zookeeper that phi-cluster relies on does have leader node and leader selection.

Phi-cluster has an extensible task definition and execution mechanism. It comes with a simple executor and task definition interface based on Java’s executor service and Runnable interface. We also provide a Gearman based executor just to illustrate the ability to use a completely different executor, which runs outside of the JVM. However, we do not plan to focus on Gearman beyond proof of concept.

Phi-cluster focuses on fault-tolerance, and tries to provide it with a simple framework. Its use of cluster wide consensus service has implications on its scalability. We think Phi-cluster is more suited for applications with relatively long-running tasks. It is important to keep the state updates of tasks and inter-task dependencies at minimum.

Sweden Journey – Diary Entry I

December 15, 2012 § Leave a comment

I travelled to Sweden in 2011 September, and wrote some diary entries. I have decided to post them here to share the great time I had there.

10:10am, Saturday, September 10, 2011 ~ In train, heading to Oslo

After a hectic start with forgotten train tickets at home in Seattle, and the luggage that didn’t come to Stockholm and its unknown whereabouts, I now enjoy the trip in full extent. This is the third day.

Flen train station on way to Oslo train trip to Oslo train trip to Oslo

HP office is very nice. Ulrika helped me settle down. People whom I have met got first surprised when I told them why I am here, but soon they seemed to find it cool.

Stockholm looks like a very nice place. It seems to be a very orderly and clean city. Transportation has been super easy for me so far. The most I am impressed with is that anyone whom I talked to, including supermarket cashiers and bus drivers, knows English quite well, and that they are very friendly. Once, while I was looking around to figure out my way out from central station, a lady approached me and asked me whether I needed any help.

Weather is not bad at all. It is not cold. I was with polo t-shirt and since my luggage did not come, I did not have anything else for a day but I did not feel any cold. Its vegetation looks like that of Seattle, with somewhat smaller trees, though.

Hostel is nice. It is nicely decorated and very clean. Wi-fi connection is perfect. I feel a little old there, though. Most people are in their early twenties. The dorm room I spent first two nights had twelve people, with equal number of male and female occupants.

The downtown is pretty colorful. Friday night was very crowded as I was returning from office around midnight. The hostel’s location is very central in downtown. I haven’t seen Stockholm much yet. I’ll explore around after I return from Oslo.

This train to Oslo is a little bit disappointing. It is an old train, not a fast modern one that I was expecting. I was also expecting to have Internet connection because the ticket info indicated so. On the bright side, I enjoy scenery more. The scenery is similar to that of Seattle: green, lakes, and rural houses and farms, which are also very much like the ones around Seattle. The houses are also similar in terms of shape and structure.

6:30pm ~ In downtown Oslo

There are so many historical buildings. It seems to be denser than Stockholm, good for walking. Streets are crowded. A lot of youngsters. It’s more expensive than Stockholm. Much more. A regular Burger King meal costs around $20. Tea is $4 to $7. It doesn’t seem to be as clean as Stockholm. Jaywalking is common.

Oslo downtown Oslo downtown Oslo downtown

Running Hadoop Cluster on Ubuntu

March 2, 2012 § 4 Comments

Here is how I set up a tiny, little Hadoop cluster… with a master (NameNode) and three slave nodes, total four nodes. The master node has also JopTracker and Secondary NameNode. I just list here how to set this cluster up and get it up and running. I do not go into details of any components of Hadoop cluster.
« Read the rest of this entry »

Running RabbitMQ on Mac

February 23, 2012 § 1 Comment

I have recently need to run RabbitMQ server in my development environment. Apparently, setting up and running RabbitMQ server on Mac are pretty straightforward with MacPorts. It’s quite well documented on RabbitMQ web site here:
« Read the rest of this entry »

Setting Up Hadoop on MacOSX Lion (Single Node)

January 18, 2012 § 2 Comments

Brandon Werner’s blog entry provides step-by-step instructions. The only other thing that I needed to is to enable SSH with no password for localhost access.
« Read the rest of this entry »