@Author(value="Greg Gibeling, Lilia Gutnik, Nathan Burkhart", email="gdgib@berkeley.edu", website="http://www.eecs.berkeley.edu/~gdgib/cs294-1") @Project(value="RADTools", version="1.0.3", status=Research) @Copyright(start=2006, end=2007, holder="Regents of the University of California") @Revision(value="$Revision: 1.16 $", date="$Date: 2006/12/21 00:42:47 $", tag="$Name: $")

Package radtools

The main package of the RADTools project, contains all the primary documentation.

See:
          Description

Class Summary
Main The main class of the RADTools project.
 

Package radtools Description

The main package of the RADTools project, contains all the primary documentation.

Editors Note: We apologize for the state of the diagrams in non-IE browsers. We are still working through debugging SVG files in many of these browsers, and would appreciate any help our readers can offer.

1.0 Introduction

Distributed systems, and web based services in particular, are notoriously difficult to manage. It is widely accepted that difficulty of managing a web-service increases super-linearly with the number of distinct component services. For a production service this is bad, but quite possible survivable by the graces of dedicated and skilled management staff.

However for a research project this situation is quite untenable. First off, managing even a relatively small web-service needed to generate useful research data can require an unreasonable amount of expensive graduate student, professor and staff time. Second, and far worse, modern projects involving power consumption, statistical machine learning and other ideas founded on automated configuration changes are incredibly difficult if not impossible.

The key problem is that even those few applications which provide a good management interface rarely support automation or dynamic configuration changes, and even fewer provide any kind of uniform configuration and control interface.

This problem was incredible evident during the labs for CS294-1 RADS, Fall2006 wherein we, as graduate students with a simple assignment and detailed instructions, still had a fair amount of difficulty in sorting out basic management tasks for the first time.

As an effort to make management of these systems slightly more tenable, and more important make research using them far easier we have developed this project, dubbed RADTools.


2.0 RADServices

By setting up a uniform representation of the components of a distributed system, we can simplify their management significantly. Providing a uniform interface also goes a long way to help automate the process, as it significantly reduces the complexity of designing an automated manager system.

To this end, every service which can be managed b RADTools must be represented by an object which implements at minimum the RADService interface. This interface includes state management, structure and a uniform abstraction for expanding this interface, both statically and dynamically. In the below sections we describe the state and structure abstractions, however the javadocs at the above links are far better references for any code based on this project.

2.1 State Management

The primary property provided by a RADService is "RADService.State". This enables some of the most important benefits of RADTools: namely failure management, including both causality and failure of the ability to manage services.

In the current implementation state management has been restricted to positive feedback only. That is to say, we never make assumptions about the state of a service, but instead rely on positive feedback to determine if a service is running, stoppped, failed, etc. This is vital, as erroneous assumptions about service state which trigger corrective actions may in fact exacerbate the situation. In the future, when adding assumptions about service state (e.g. timeouts and other such tricks) the programmer must be careful to ensure that their assumptions are either acted on in such a way that the situation cannot be exacerbated, or ensure that they first enforce the assumption.

For example if the liveness check for a service times out, the current implementation will mark the service state as unknown. If instead the service is to be reported as failed the check, upon timeout, should first either crash or stop the service, thereby making the assumption of failure true. This will ensure that the pre-conditions for any corrective actions are properly met.

2.2 Structure

There are currently three structures over RADServices fully implemented, all of which are based on the RCF tree ADT. These are the composition, dependency and management trees, all of which are shown in the below diagrams.


Composition


Dependency


Management

The fourth structure is the communication graph (shown below), which is meant to capture the path of data in a distributed system, in particular requests or RPCs in a web-service. In most web-services a tree model would be appropriate, as all components are RPC based, however in more general distributed systems, this will often not be the case, and as such we prepared for a graph abstraction.


Communication

However the RCF graph ADT is not yet complete and until after the CS294-1 projects were over, we had little access to path based analysis tools meaning we had no good means or reason to capture communications paths. As such this structure is currently unimplemented, though at the time of this writing we are already beginning to remedy this.

The key use of these structures, in particular the fully implemented trees, is to quickly propagate service state changes and manage causality. In the dependency and management trees, failures propagate down as the children of a node are those which depend on it or are managed through it. In contrast in the composition tree failures propagate up, as larger services are built out of smaller ones. This accurately models the fact that, for example, the failure of a physical machine will result in the failure of a virtual machine and that the failure of a database could result in the failure of the entire web-service.

State propogation in the communication graph is slightly more complicated. If it can be reduced to an RPC communication tree, clearly failures propagate up, and this is another kind of dependency tree. However as a general graph the failures must be propagated in the direction of the data flow. Combined with some vertex local information this could generate full failure causality information, something clearly missing from existing distributed systems tools. A simple example of failure propogation is shown below.


Failure

The biggest benefit of these structures to the casual user of RADTools is their display in the main window, and the resulting ability to start all the component services of a web-service in a single click, not mention the visualization of failures.

2.3 Events & Continuations

Events are widely used in RADTools to model causality, and thereby implement policy. For example, most suggested policies for power savings in a web-service datacenter are based on starting and stopping servers based on the current service load. Shown below is a possible diagram of the event sources and sinks which could implement a policy like this.


Event Chaining

Of course more general examples can be manufactured, but RADTools aims to provide the event framework, rather than implement any specific policy.


3.0 Implementation

Our current implementation is geared heavily towards the services and experimental setup which was used in the labs for CS294-1 RADS, Fall2006. This was driven both by the availability of this setup to test against, and our desire to ease future work on the other projects from that class.

Out choice of the java language was driven by the availibility of the JSCH and RCF libraries, in addition to the cross-platform compatibility and high level of abstraction provided by java. In contrast to a collection of shell scripts, this means that RADTools provides a far more useful (and robust) abstraction. In contrast to other mainstream programming languages, this gives us access to a richer set of libraries.

3.1 Current Services

We have currently created implementations of the RADService interface for LigHTTPD, HAProxy, Ruby on Rails, MySQL and Memcached. There are also RADServices for VMware, Linux and Fedora Core. In fact the MySQL and Memcached services are currently restricted to Fedora, primarily because that is what we had to test with on the Millennium cluster at Berkeley.

The linux system RADService includes support for querying nagios, if it is running, and reporting the useful nagios statistics as properties of the linux system: "Nagios.CurrentLoad", "Nagios.NumUsers", "Nagios.NumProcs", "Nagios.PercentDiskFree", "Nagios.PercentMemUsed". This is a primitive form of service discovery but is a good example of how it might be accomplished. RADTools is the base service, and represents the RADTools application in the various structures (RADService.management() in particular). It includes the code to generate the main window, including tree views of the service structures. Furthermore, as the current implementation of RADTools is focused entirely on the management of web-services, the RADTools object includes references to the datacenter, upon which all physical machines are assumed to depend, and the website or web service, the ultimate composite service which RADTools is meant to manage.

Finally RADTools includes a queue, which is used to provide scheduling of long running tasks. In addition to allowing a more controlled model of execution, this central queue of tasks is shown in the GUI providing positive feedback to the user. A current deficiency of the skiplist used to keep tasks in the order they should be run (timer tasks may specify a date after which they are to run) means that duplicates are not currently eliminated, however this will be fixed very shortly.

3.2 JSCH Library

The JSCH library is a fairly simple implementation of the SSH protocol in java. This was a clear requirement for managing remote linux based webservices, and in fact one of the original frustrations that sparked RADTools, was the need to keep around 7-10 SSH sessions open at a time, to manage even a relatively simple web service.

We downloaded the JSCH library, and it's attached compression library from JCraft. While the library itself contains no major documentation, the examples were enough to jumpstart our development, despite their quirks.

The largest, and really only, drawback to our use of this library is it's design: JSCH includes multi-threaded code without clear documentation of why or when thread safety may be an issue.

3.3 RCF

RCF is a set of library code developed by Greg Gibeling, originally for the RDL Compiler, a part of the RAMP project. However a large part of its design and development has been motivated by this project, which as we discuss in our future work section, is actually a very good thing. There are three key pieces to the RCF libraries which are part of this project: data structures, events and components.

The transactional data structures are the basis of nearly all of the RADTools code, and provide some vital functionality: the ability of any implementation of Collection to generate an event in response to a mutation. This is what allows us to write the below code, which configures HAProxyLinux to add a new proxy pool, and a new server to that pool.

HAProxyLinux.HAProxyPool pool = new HAProxyLinux.HAProxyPool(new HostPort.Default("0.0.0.0", 10000));
proxy.pools.add(pool, "apool");
pool.servers.add(new HAProxyLinux.HAProxyServer(new HostPort.Default("localhost", 25), 22, 3000, 1, 2), "aserver");

The above code is concise, easy to understand and similar to what would appear in the HAProxy configuration file, thereby making it easy to learn for those familiar with HAProxy, and easy to automate even for those who are not. However what really makes that three line code snippet interesting is that, because of the transactional data structures, it will actually cause a new HAProxy configuration file to be generated, uploaded over SSH to the server, and HAProxy to be gracefully restarted to use the new configuration.

The second main component of RCF used in RADTools is the event model. As noted above, the transactional data sructures rely on the events package, to provide a set of standard interfaces for sourcing, syndicating and sinking events. We omit further discussion as it would merely duplicate the Events & Continuations section above. The third and final main component of RCF used by RADTools is the component framework. While RADTools relies extensively on this, it was also the catalyst for its final development.

The component framework provides an abstraction of reflection with extensions for the dynamic addition of operations (methods) and properties (fields) on components (objects). The ability to dynamically add properties (fields) to a component is the basis of our integration with nagios, as seen in LinuxSystem.

Furthermore the component framework includes support for generating property change events in response to property changes. This allows the GUI to be kept in sync with the properties, and the configurations to be kept in sync with the GUI. Please see the GUI package and AbstractDynamicProperty.gui(rcf.core.framework.component.DynamicBound.GUIType) for details about the automatic GUI generation, and property synchronization code.


4.0 Concerns & Obstacles

RADTools has been designed to fill two different, but similar roles, first to allow a person to more easily manage a distributed system, in particular a web service, and second to allow the automation of that management, specifically for research purposes.

The second role is easiest to imagine in the case where an automated, (perhaps SML based) management system is written in java and linked against RADTools. In this section we strive to document some of our development difficulties in the hope that projects seeking to integrate with RADTools this way will be able to avoid the pain that we suffered.

4.1 JDT Bug

For most any language, and certainly for any large code project an IDE is an indispensable tool, and Eclipse for java is one of the best. However at the time of this writing there is an open Eclipse JDT bug which is relatively uninteresting, except that any user of this code, particularly the RCF libraries, must sometimes work around it.

This bug causes incremental compilation of the RCF libraries to fail after edits of some files, particularly those involving the rcf.core.util.map package or the rcf.core.util.collection.Skiplist class. The result will be that random compiler errors will appear in possibly only vaguely related files (including this one, if there is even a link to the skiplist file), often with an error appearing on the first of the line of the file (always a comment in this project). The solution is to perform a clean build using the Project->Clean menu to fully rebuild the project.

The bug has already been fixed in the next versions of Eclipse (we submitted the bug report a month or so ago) which will be released in the next month or two.

4.2 Javadoc Bug

There is a significantly more problematic bug in the Javadoc tool, produced by Sun. This makes the javadocs for the rcf.core.util package impossible to generate.

The problem is in the ability of javadoc (and perhaps javac as well), to trace the class hierarchy of certain inner classes, causing it to emit spurious errors and warnings and finally to throw an exception and terminate. We have yet to fully isolate this bug, despite quite some time trying, and therefore have simply omitted that documentation from this website. We hope to find a workaround, or a solution soon.

This problem is unfortunate, as javadocs are quite possibly one of the best code documentation tools in widespread use, however our code in question is quite complicated, and uses complex features added in Java 1.5, so we fully expect that many of the bugs will disappear when Java 1.6 goes to full production release.

4.3 Thread Safety

Both JSCH and Swing use java threads, without the consent or intervention of the client programmer. The fact that java threads are ubiquitous, cross platform and standardized is wonderful as this makes writing multi-threaded code easy. However both Swing and JSCH sometimes lack appropriate documentation to describe the thread requirements of using them.

Please note that for swing the threading reference is the Concurrency in Swing article. As a result of this we spent a fair amount of time debugging threading problems, only two of which could be traced to our own code, or lack of understanding about these libraries.

Given how powerful both JSCH and Swing are, we find that even with these problems, using them allowed us to produce a significantly better project in a much shorter time. However the clear lesson here is that any library which introduces threads to a program must document how it does so, why it does so and what restrictions the library imposes on the user to enforce thread safety. The one escape clause in this requirement, which we must invoke in places for this project, is that such documentation may only be missing if the threading is provided by a base library which is missing this documentation itself.

An unfortunate consequence of this is that anyone using RADTools as a codebase may currently encounter some concurrency bugs. We have not, and we will be more than happy to debug them should they arise, but this is a possible issue.

In general, RADTools follows a simple Swing threading model: long running tasks should be scheduled through RADTools.schedule(rcf.core.concurrent.schedule.TimerTask), and GUI operations should be scheduled using SwingUtilities.

As a final note, RCF library provides no thread safety or synchronization, with the exception of the GUI service which will maintain thread safety between a worker thread and the Swing event dispatcher. Any users may also wish to investigate the AdapterHelpers.cast(Object, Class, rcf.core.concurrent.schedule.Runner, rcf.core.util.groups.ImmutableTriple[], rcf.core.util.adapter.TypeAdapter[]) method which can be used to add synchronization to nearly any object or method.


5.0 Conclusion

At the end of this project we are now able to, in 5 minutes, configure a complete Ruby on Rails web-application, launch all of the requisite services and benchmark it and have graphs automatically generated (See AdvancedResearchIndexLoadLinux and ARIL). Of course a real benchmark takes slightly longer than 2min to run, but the point is that the setup for this style of benchmark took two weeks for us toward the beginning of the semester (during Lab3 for example), without RADTools.

This is by far a clear win: we can now easily do research that was difficult, and unreliable in the past.

In addition to making life easier, this means that more complex, and realistic web services can now be researched, and that projects can, more easily, experiment with a variety of system configurations without learning all of the text file formats, and dealing with logins on 13 different machines, as was literally the case during CS294-1 RADS, Fall 2006.

Aside the short term benefits described in this section, we believe that RADTools opens up opportunities previously closed because of their difficulty which we discuss below.


6.0 Future Work

6.1 Library Development

A big part of this project was developing the pieces of the RCF library which were needed to implement this project. While the event model and component framework were both well planned out and partially complete, there was a fair amount of work to finish them off.

At the end of this project it has turned out that the code based on these libraries is significantly easier to both write and understand. Furthermore, without them the event and continuation programming of website policy required for SML would be impossible.

However as with any library there is still work to be done, everything from better concurrency support in the rcf.core.concurrent.primitives package to simplify the problems outlined in section 4.3, to a more complete AutoGUI in the gui package, to a minor rewrite of the rcf.core.util.collection.Skiplist datastructure to allow elimination of duplicate tasks in the RADTools task queue.

6.2 Distributed Implementation

RADTools was designed to provide centralized management of a distributed service, specifically because of the RADLab goal of allowing a single person to design, asses, deploy and operate a large scale web service wherein the single person clearly implies a natural point of centralization.

However going forward with this project it's clear to us that managing a large number of machines from a single point will result in a fairly large load. Currently the management traffic is restricted to simple state updates and occasional configuration uploads, however in the future, access to logs and a larger set of continuous performance data suggests that management of a distributed system, must itself be managed and distributed.

Because of the way the RCF event model and component framework have been designed, it would be a simple matter to extend them to include RMI (Remote Method Invocation), as in JMX, upon which the component model is loosely based. This should enable two major features: first and foremost it would easily allow distribution of the management system without breaking the abstraction in any way, and second it would allow non-java code easy access to the management system, by tapping into the RMI mechanism.

6.3 Service Discovery

Currently the structure (the machines in use, and the services they run, but not the configuration of those services) of the system to be managed by RADTools must be hardcoded, for now in Main.inner(). Given the separate class compilation model of java this is not an onerous requirement, and yet it would clearly be nice to simplify the process of describing a new system, as this is a painful task and must be completed before RADTools can be used to manage a system.

Obviously adding a simple system description language would go a long way to decreasing the perceived cost of describing a new system, even if it does not make any real difference, since the java is quite concise and self-documenting. However far more interesting would be integration with some automatic service discovery system.

There are currently two usage models in mind for RADTools, first, the management of a pre-existing system and second, setting up a new system. Given that RADTools includes the vast majority of the configuration options for the various RADServices it supports, the second model is clearly both preferable and tenable, as the initial setup of a distributed web service is often the most painful part.

However in both cases, there is information a user should not have to enter. Clearly some things, like the DNS name or IP address of at least one server involved, must be entered. However information like which component services each server has installed, or can run could be discovered by simple inspection of installed programs.

Furthermore path based analysis could be used both to discover relationships between component services, which could then be reflected by the communication structure . At the time of this writing we already beginning to work with another group from CS294-1 to do just this.

6.4 SML & Plugins

One of the biggest goals of CS294-1 RADS, Fall 2006 was to bring together Statistical Machine Learning (SML) and Systems graduate students, in the hope of creating hybrid projects. In that spirit one of the main goals of RADTools is to allow a researcher in Statistical Machine Learning, with some Java skill, but no detailed knowledge of web service administration to construct just such a hybrid management system. Goals in this area range from diagnosing problems, and even fixing them, to power and CO2 conservation.

Our contribution with this project is an abstraction and codebase which we hope will remove from future classes and research, the drudgery we felt working with Ruby on Rails administration during the class labs. Given the responses of some of our fellow students, we feel we've already gone a long way towards this goal, but time and further projects will tell.

Contributing to this research in a very real way was a major influence on the design of RADTools, primarily in the decision to use the RCF library in order to simplify further coding. For example we use the RCF event model to capture RADService state changes, which are propagated by service state proxies through the various RADService structures (most notably management). This event model was specifically designed to be generalizeable to any kinds of events, including periodic performance data gathering, from “Nagios.CurrentLoad" to radtools.services.researchindex_load. Specifically we have planned that any SML or other “policy” manager should be designed as a series of event sinks which implement DSP or SML algorithms over time series data to produce service control calls, e.g. to set a RADService.radServiceState(), as shown below.


Framework

6.5 Beyond Web Services

The actual RADTools implementation work throughout this semester has been very clearly focused on the management of web services, in particular, the stack described in section 3.1 Current Services. However the overall design has remained slightly more general than needed for the class project.

As a result we have already begun investigating how RADTools could be used to manage other distributed systems. In particular, because RADTools relies on the RCF library, which is a key part of the RDL Compiler v3 (RDLC3, see the RAMP website for more information), we believe that it will be both easy and very fruitful to adapt RADTools to manage an running RDL host or target system. In particular, RDL provides support for cross-platform system design and emulation, which implies that there are a number of heterogeneous platforms which must all be running components of the same system at once, and working in concert, exactly the scenario RADTools is meant to handle.


Appendix A: The Presentation

The final Presentation for CS294-1 RADS, Fall 2006, given on Wednesday December 13th, at the U.C. Berkeley RADLab.


Appendix B: The Structure of a Website

This appendix documents the configuration of the web service as coded in Main.inner().


Appendix C: Acknowledgements

Much of the library code for this project was co-developed for RADTools and the RDL Compiler v3, a part of the RAMP Project. As such we would like to thank the Berkeley Wireless Research Center and the Gigascale Systems Research Center. We would also like to thank Professors Dave Patterson and Armando Fox for guidance and teaching CS294-1 RADS. We would also like to thank Peter Bodik and James Zhang for their work on CS294-1, including the initial webservice setup. And finally we would like to thank all the students who struggled through CS294-1, the pain of the class labs made the nessecity of this project clear to us.


TODO: Javadoc review, Seal/Finalize
TODO: Fix annotations (project name & website, add lilia and nathan to authors, license, etc...)

Author:
Greg Gibeling