For the last few months researchers at Queen Mary, University of London, working with colleagues from Durham University and Brown University, have been using the GridPP infrastructure to investigate Supersymmetric theories. This is a continuation of work that had already been moved from a single machine to a local batch system but needed the extra power the grid could offer.This piece is copyright GridPP if you wish to reproduce it please credit GridPP and contact Neasan ONeill to say you are using it.
Edinburgh Compute and Data Facility (ECDF) is a large local computing resource for hundreds of researchers at Edinburgh University engaged in pursuits from across the academic spectrum such as analysing brain scans to understanding mental illness and exploring the dynamics of complex chemical systems. The diverse user base brings a broad range of requirements that need to work happily together and GridPP are one of the more challenging users.This piece is copyright GridPP if you wish to reproduce it please credit GridPP and contact Neasan ONeill to say you are using it.
The Enabling Grids for E-SciencE (EGEE) project closed on 30 April 2010. The project brought together a computing infrastructure, software tools and services to support more than 10,000 scientific researchers across more than 170 research communities. During the two year term GridPP played a key role in EGEE's success, being the biggest national contributor of computing resources.This piece is copyright GridPP if you wish to reproduce it please credit GridPP and contact Neasan ONeill to say you are using it.
The GridPP/Imperial College developed Real Time Monitor has undergone another overhaul and is now available in different versions, supporting a greater number platforms, is more stable and has a new website.This piece is copyright GridPP if you wish to reproduce it please credit GridPP and contact Neasan ONeill to say you are using it.
Hmm. We had a malicious user's DN on the glasgow system this morning. Am sure that other UKI sites may be affected too. Be careful with your cleanup processes as we missed something the 1st time round. Grr.
After being all enthusiastic that the gSOAP errors had been nailed, we failed two SE tests in the last 24 hours. Exactly the same issue as before.
As this error message is so vague it looks like lcg-rollout is our only hope.
I note in passing that Glasgow has one of the most reliable SEs in the UK for ATLAS (2.1% job loss, only beaten by Oxford who have 0.8%; UK average in Q3 was 8% loss) so this is particularly galling.
Shouldn't we be making the results as seen by our real customers rather more important than a once an hour stab in the dark from ops?
Even after the successful upgrade of DPM we started to get plagued again by SAM test failures with the generic failure message:
This time they came principally from the SE test, instead of from the CE-rm test.
For a while I wondered if there was a DNS problem, but this seemed unlikely for two reasons:
On this suspicion I changed the CE configuration to download CRLs every hour and for the clients do download these from the CE every 4 hours.
I made this change on Friday and, so far, we haven't seen the error again.
My eternal complaint with X509/openssl is why the error is reported as "CGSI-gSOAP: Error reading token data header: Connection closed" and not "CGSI-gSOAP: Error reading token data header: Connection closed [CRL for DN BLAH out of date]".
Is that so very hard to do?
Finally we are green for the latest Atlas releases...
We've made a lot of progress this past week with ECDF. It all started on Friday 10th Oct when were trying to solve some Atlas installation problems in a somewhat ad hoc fashion.
We then incorrectly tagged/published having a valid production release. This then caused serious problems with the Atlas jobs, which resulted in us being taken out of the UK production and missing out on a lot of CPU demand. This past week we've been working hard to solve the problem and here are a few things we found:
1) First of all there were a few access problems to the servers for a few of us. So it was hard to see what was actually going on with the mounted atlas software area. Some of this has now been resolved.
2) The installer was taking ages and them timing (proxy and also SGE killing it off eventually). strace on the nodes linked this to a very slow performance while doing many chmod write to the file system. We solved this in a two fold approach
- Alessandro modified the installer script to be more selective regarding which files needs chmoding, but the system was still very slow.
- The nfs export was then changed to allow asynchronous write which helped speed up the tiny writes to the underlying LUN considerably. There is a worry now of possible data corruption, so should be borne in mind if the server goes down and/or we have edinburgh specific segv/problems with a release. Orlando may want to post later information about the nfs changes.
3) The remover and installer used ~ 3GB and 4,5 GB of vmem respectively and the 6GB vmem limit had only been applied to prodatlas jobs. The 3GB vmem default started causing serious problems for sgmatlas. This has now been changed to 6GB.
We're also planning in the ce to add "qsub -m a -M" SGE options to allow the middleware team to monitor better the occurence of vmem aborts. We also might add a flag to help better parse the SGE account logs for apel. Note: the APEL monitoring problem has been fixed. However, that's for another post (Sam?)...
Well done to Orlando, Alessandro, Graeme and Sam for helping us get to the bottom of this!
Well, I was waiting for Mike and Andrew to blog this, but they haven't. They very successfully upgraded Glasgow's DPM to the native 64bit version on Monday last week (when we had upgraded to SL4 only the 32 bit version was available). This was a significant step forwards but required the head node and all of the disk servers to have their OS rebuilt without losing data, and the database restored onto the head node.
It went very well and we were up and running again within 6 hours - no data lost!
We are also seeing an improvement in the SAM test results, with the spurious 'gSOAP' errors which were plaguing us now seemingly having gone (fingers crossed!).
It's terrible that the LHC is not running right now, but it does mean that interventions like this can be done.
Great work guys!