CaArray Extension by Dr. Matt Aldridge

From Informatics

Jump to: navigation, search

Contents

Background

Dr. Matt Aldridge at Cambridge Research Institute has developed an ad-hoc extension to the caArray 2.2.x application so that data can be bulk-loaded into caArray using caArray API instead of its web interface. Matt has agreed to share the code with the community after the caArray development team runs a sanity check on the code. What we at the MATKC would like to do is to confirm his steps to re-package caArray and to confirm that his extension works.

Oleg's Task and Progress

Oleg's Task

We will greatly appreciate if Oleg can take a look at Matt's code, follow Matt's instruction on re-packaging the code to enable the extension, and then test the extension with a sample dataset. We will provide the sample dataset to Oleg.

Oleg's specific tasks (added on 10/6/2009):

On caArray 2.2.0

  1. Make available free disk space on his desktop. Make sure that the MySQL database is up.
  2. Download UPT 3.2.0 at https://gforge.nci.nih.gov/frs/download.php/6516/upt_gui_distribution_3.2.0-200904291406.jar
  3. Follow UPT installation guide at https://gforge.nci.nih.gov/frs/download.php/7022/upt_3_2_2_lsd_1_1_0_installation_guide.pdf to install UPT (You can use the graphical installer)
  4. Download caArray 2.2.0 at https://gforge.nci.nih.gov/frs/download.php/6518/caarray_gui_distribution_2_2_0.jar
  5. Follow caArray installation guide at https://gforge.nci.nih.gov/frs/download.php/6703/caarray_2_2_1_installation_guide.pdf to install caArray (You can use the graphical installer). Make sure that a different database is used if you already has caArray installed in the past.
  6. Follow https://cabig-kc.nci.nih.gov/Molecular/KC/index.php/Caarray003 to create users for caArray2.2.0 in UPT
  7. Log into the caArray instance, create an experiment following https://gforge.nci.nih.gov/frs/download.php/5774/caArray_2_2_users_guide.pdf, and upload a dataset (Come to ask for Fan or Zhong to help)
  8. Shutdown caArray 2.2.0 instance and UPT instance
  9. Follow Matt's instruction to insert his code into caArray
  10. Restart UPT and then caArray instance
  11. Log into caArray and create another experiment and load another dataset using the web interface
  12. Develop a client application to access Matt's code to add an experiment and dataset into caArray from the command line (work with Zhong)
  13. Document success/failure and any issues for steps 7, 11, and 12.

On caArray 2.2.1

  1. Shutdown caArray 2.2.0
  2. Download caArray 2.2.1 at https://gforge.nci.nih.gov/frs/download.php/6668/caarray_gui_distribution_2_2_1.jar
  3. Install caArray 2.2.1 following https://gforge.nci.nih.gov/frs/download.php/6678/caarray_2_2_1_installation_guide.pdf. Make sure that a different database is used since you already had caArray 2.2.0 installed
  4. Follow https://cabig-kc.nci.nih.gov/Molecular/KC/index.php/Caarray003 to create users for caArray2.2.1 in UPT
  5. Log into the caArray2.2.1 instance, create an experiment following https://gforge.nci.nih.gov/frs/download.php/5774/caArray_2_2_users_guide.pdf, and upload a dataset (Come to ask for Fan or Zhong to help)
  6. Shutdown caArray 2.2.1 instance and UPT instance
  7. Follow Matt's instruction to insert his code into caArray2.2.1
  8. Restart UPT and then caArray instance
  9. Log into caArray and create another experiment and load another dataset using the web interface
  10. Develop a client application to access Matt's code to add an experiment and dataset into caArray from the command line (work with Zhong)
  11. Document success/failure and any issues for steps 5, 9, and 10.

Oleg's Progress (Updated by Oleg)

Progress Table

Matt's Code

Caarray-cri.zip

Email Threads as Additional Background (Please do not circulate)

7/28/2009

Basically we have extended the caArray with an additional stateless session EJB that is remotely accessible and exposes some methods that are only available on local interfaces, i.e. only through the web functionality. The method implementations are mostly pretty simple since they just delegate to existing service EJBs through these local interfaces, although this would be somewhat daunting for a non-EJB Java developer.

I am happy to send the caArray developers what essentially amounts to 2 Java files, one an interface, the other the implementation class, although I don't think they will benefit greatly from this.

I'm afraid the way I package this and make it available for developing client-side Java scripts/programs to populate caArray deviates from the way in which caArray is developed and encouraged in the caArray FAQ on MAT-KC. This is because I'm used to using maven2 rather than ivy for project and dependency management. I've created a maven project for our caArray extension and have a script for uploading the Java libraries that caArray requires to our local repository. I also have a maven project for client applications that upload, validate and import data. This might make it difficult to distribute our extension to other caArray users and of course there is the concern that our implementation is faulty in some way. I wonder if we should perhaps get the appropriate person from the caArray development team involved at this point.

I haven't tried our EJB with 2.3.0 yet. I did have to make changes for 2.2.0 however when the internal API for the ProjectManagementService changed and a method I was previously using disappeared. I had to track down the change made by the caArray developers and adapt accordingly. There is an inherent risk with going down this route as these internal APIs are always going to be more liable to change. I expect more of the same for 2.3.0.

7/30/2009

We are running two instances of caArray for test and production purposes, currently version 2.2.0. These are deployed in the normal recommended way, although I find the non-GUI command-line installer more practical on our server hardware. I am not using a different or specially modified JBoss instance.

To create the maven project, the first thing I do is upload the dependencies, namely caarray-common.jar, caarray-ejb.jar and caArray's dependencies, to my local maven repository and create a maven project that references these dependencies. I actually get these from the source distribution in lib/caarray-project and have scripted the process. I then use a maven command to create an eclipse project. I also use the caArray source code for debugging purposes, opening up a JDWP port on the JBoss server for remote socket debugging and connecting to this from within eclipse.

With regard to packaging our EJB and deploying within caArray, I take the original EAR file (caarray.ear), extract the EJB jar file (caarray-ejb.jar), add the additional classes using 'jar uf', add the updated EJB jar to the EAR file in the same way and then copy this back to the JBoss deploy directory. I have a 5 line script to do this.

The security aspects are handled in the same way as other caArray EJB beans that have remote interfaces, i.e. by configuring our EJB implementation class with the AuthorizationInterceptor, etc.

On the client side, I have a maven (and eclipse) project for accessing caArray via its remote Java API, not the grid API. I just extend this project to include a jar containing our additional EJB interface. In addition I have to include some caArray exception classes that aren't in caarray-client.jar - I do this by adding caarray-ejb.jar to the client side classpath (it's dirty, I know).

We don't use the Grid API at all and I am a little concerned about the possibility of the Java API disappearing altogether. The situation here is that we don't currently have an easy way to make our caArray server available to the outside world (problems with firewalls) so the grid service holds little immediate value for us, I actually have the Grid and UPT JBoss servers shut down most of the time.

I've attached the Java source (works for 2.2.0 and 2.2.1) and the small script I use to package the resulting classes into the caArray EAR file. Feedback/criticism from the development team welcome, although I suspect what will be more of interest are the methods I've exposed.

As you can tell, I'm hesitant about releasing this to other caArray users. For us this is an interim solution while programmatic APIs for uploading and importing data are developed by the caArray team - I understand this might come early next year. As such what I have is a short-term hack, a bunch of methods I need bundled together into a single EJB bean with little consideration of proper design, i.e. grouping of related methods, etc.

7/31/2009

On the client side, I connect in just the same way as using the Java API normally, i.e. using CaArrayServer. Well, actually I have my own version of CaArrayServer which has an additional lookup for our CaArrayServiceFacade EJB in the connectToServer method:

 caArrayServiceFacade = (CaArrayServiceFacade)
          initialContext.lookup(CaArrayServiceFacade.JNDI_NAME);

Somewhat annoyingly the CaArrayServer class is declared final and cannot be subclassed so I've had to take a copy and have to remember to keep this up to date with any changes in caArray releases. The following snippet shows how to connect and then access the service.

 CaArrayServer server = new CaArrayServer(hostname, port);
 server.connect( username, password);
 CaArrayServiceFacade service = server.getCaArrayServiceFacade();

For long operations, like validation and import of files, I poll for the expected change in status for the set of files involved, where the interval between poll attempts and maximum number of attempts are configurable, and throwing an exception if there is an unexpected status change. This allows for a script to be written that chains a series of such long operations together waiting for one to complete before starting the next.

The other useful bit of client-side functionality we've written is to split really large MAGE-TAB archives into batches for upload and import, by splitting the SDRF file and duplicating the IDF for each batch but referencing one of the SDRF segments. This is to address the size limits on both uploading the zip archive and import. Perhaps this has been fixed though in the most recent release?

Personal tools