*********************************************** * RESOURCE ALLOCATION AND RESOURCE USAGE * Thursday 20 January 2010 - 10.00 - 17.00 * CC-IN2P3, Villeurbanne *********************************************** Document: Meeting minutes and notes Author: Gilles Mathieu Meeting Agenda: --------------- http://indico.in2p3.fr/conferenceTimeTable.py?confId=4861 Participants: ------------ HC - Hélène Cordier RR - Rolf Rumler CB - Cécile Barbier GM - Gilles Mathieu TS - Tomasz Szepieniec CL - Cal Loomis VB - Vincent Breton GF - Géraldine Fettahi TG - Tristan Glatard CO - Cyril L'Orphelin, whose initials are actually CL but we've already got one, so... JCC - Jean-CLaude Chevaleyre (Connected through viso/phone) EM - Emanuel Medernach (Connected through viso/phone) MA - Mirvat Aljogami (Connected through viso/phone) JN - Julien Nauroy (Connected through viso/phone) _________________________________________________________________________________________ 0 - Introduction - objectives of the day ----------------------------------------------------------------------------------------- Presentation by HC material: http://indico.in2p3.fr/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=4861 HC presents an introduction about why we are here today and introduces the contributions we will have from VB and TS. Presents the open questions to be addressed: - What is the right legal term to agreements SLA/MoU/else? - What are the entities that need to be involved in the definition of a agreements? - What do SLAs apply on? What are the “fields” and units? - France-Grilles context specificities/ international context - What kind of Service Level Management do we plan on? - Expected return of investment in terms of quality of service and traceability HC presents the objectives of the day. Introduces a "round the table" to identify any other expectations _________________________________________________________________________________________ 1 - round table ----------------------------------------------------------------------------------------- TG - deputy manager of biomed. Interested in resource allocations from a user point of view. GF - works in EGI on NA3, user support VB - France-Grilles Director, LPC CL - from LAL, here to look at how resource allocation works in other contexts TS - involved in resource allocation in PL-Grid, here to share ideas GM - joint France Grilles recently, will be working with HC CB - from LAPP, from the accounting group and user support RR - still involved in the NGI topic, here to share expertise RR: there are several types of resource allocations (RA): low level e.g. sites to VOs, and higher level ones (requests for services, involving different partners...) I would like to have a clearer view about this. Need to understand the high level process and upstream part. Need to assess the difference btw what has been done in EGEE by EGAAP and what we can do now. VB: there is a technical aspect to RA. What is the policy for attributing resources? There is a lot to discuss about that. Ongoing discussion within France-Grilles (FG) management board, especially funding bodies. There is a need of accountability. _________________________________________________________________________________________ 2 - Resource Allocation: What can we learn from HPC? ----------------------------------------------------------------------------------------- Presentation by VB material: http://indico.in2p3.fr/getFile.py/access?contribId=2&resId=0&materialId=paper&confId=4861 http://indico.in2p3.fr/getFile.py/access?contribId=2&resId=0&materialId=slides&confId=4861 The paper uploaded to the agenda shows the upstream part we just talked about. RA is a request from the French Ministry. We have to demonstrate that we are able to give resources, priorities... need to show we have a policy and we control the situation. VB presents current situation which is bad. There is no clear policy, model is still under construction. The distribution of resources is not as it should (example given: 90% of EGI resources are used by HEP) which is detrimental to many communities. Challenges linked to the grid context - control is difficult - funding comes from multiple sources VB presents how RA works in the HPC world. Explains the GENCI structure and architecture. At the international/national level, RA is based on calls ("appels à projets"), through the portal EDARI (www.edari.fr) "Demande d'Attribution de Resources Informatiques". 600 projects submitted every year. Almost all get supported. An evaluation comitee is in charge of evaluating needs and distributing resources. There is a monitoring process to check whether requested resources are used. At the regional level, GENCI involves 10 partners (centres, universities...) in France. Each centre is responsible for attributing resources and reporting back on usage. In HPC, RA is based on scientific evaluation, with a priori and a posteriori control. In the grid world, scientific evaluation should be done at VO level. France-Grilles does not have the expertise to do that. Each VO should have a scientific leader that participates to resource allocation. Depending on what the amount of resources is, the "a priori" evaluation can be done on a per project basis, and then globally at NGI level. Above a certain threshold, this could be done in 2 steps, involving scientific partners of the NGI. For the "a posteriori" evaluation, we would need a scientific council, and annual evaluation of resource usage. Scientific coordinators from the VOs would play an essential role. Suggestion to create a "France-Grilles Award" for the best scientific application VB presents an action plan issued from a dialog started with GENCI about what could be mutualised. EGI is aware of the problem of RA but is busy with other things. Need to work closely with the leader of EGI.eu NA3 working group. _________________________________________________________________________________________ 3 - Discussion and questions ----------------------------------------------------------------------------------------- RR: this shows some of the processes in the HPC world. HPC is for people who have already well established projects and some recognition. Grid is different, it is for people who have an idea and need "some" resources without necessarily being able to demonstrate "a priori" the validity of their requirements. We should keep that somewhere. It would be worth the effort to explain to the ministry the difference between grid and HPC. VB: already spent a lot of time trying to do that. RR: might be worth to demonstrate resources are not wasted if it helps to demonstrate the validity of the project. A grid should remain a seed bed for new ideas. VB: we need to keep this on demand access. Anyone can try the grid, but above a certain level there should be some accountability. HC summarises: RR is talking about the low level, below the threshold. TG: 2 solutions to it. Should be controlled at VO level, not individual level. e.g. biomed has a certain amount of resources, and can decide within its members who can "try" the grid for a certain application/idea. This should be up to the VO to control that. GM: for new disciplines, this can't be done at VO level because there is no VO yet. Shouldn't the control be done at discipline level? RR: at the beginning of EGEE this was done on the principle of scientic domains. If control of resources is put at the scientific discipline level then you would allow for this kind of experimentations by new users/communities. RR: There was a mechanism in place in the IDG when it started, to gather the needs of the scientific community. The people that built that are still here, it might be worth reactivating the process. TG: We have been told that all VOs should be part of the VRC but this does not have a physical existence RR: we are here to discuss about the national scope of RA, not the international. VB: ESRIs (structures to coordinate science at an international level) are still not ready. All communities are fighting to get as much as possible for their own project, and not really trying to collaborate. CL: in order for "a priori" study to make sense, the infrastructure has to be oversubscribed. If not this is insane. for the "a posteriori" study we need to assess how much of our resources are delivered to French communities, and international ones. For French political bodies, we need to justify that resources delivered to non-french communities are not wasted. This is a matter of demonstrating that we get this back somehow. VB: Agree. there is a "return on investment" perspective. RR: yes but what is "return on investment"? e.g. giving resources and gaining scientific publications, how do we measure that? HC: globally there are three points we can extract: - the Composition of the scientific commitees that decide on RA - how will we have to consider the international level - what is the return on investment This joins our open questions for this afternoon's discussion _________________________________________________________________________________________ 4 - PL-GRID Operations model ----------------------------------------------------------------------------------------- Presentation by TS material: http://indico.in2p3.fr/getFile.py/access?contribId=4&resId=0&materialId=slides&confId=4861 TS involved in this process for a few years now. Realisation that best effort is not enough. - user expectations are different depending on the community - provided resources are heterogeneous Idea: the PL-Grid project acts as a unique point of contact for users with many providers. in Poland, RA policy is decided at the Resource Provider (RP) level (final decision). The sites are responsibles for how the resources are used. At international level, participation to the gSLM.eu project to increase understanding of resource allocation. Add a user dimension to RA: provide a service to a specific user/VO Importance of the "non trivial quality of service" in the definition of the grid by I. Foster From an ITIL point of view, a grid is about delivering "value" to customers. -> Value = Utility*Resources / Warranty*QoS Overview of ITIL and the gSLM project (service level management for grids) - www.gslm.eu Definition of the main grid service: providing resources to users with requires QoS RA process in PL-Grid: 1- User asks for resources by proposing a SLA. Threshold set on the amount of usage required Above which the decision from a commitee is needed 2- Verification coming from scientific community, under the NGI umbrella. The SLA becomes "open" for RPs to agree on it 3- RPs provide subSLAs where they propose what they can provide. 4- Discussion process - User needs to agree on each subSLA. 5- The NGI can agree a subSLA with a RP on behalf of the user. Relations where agreements are needed: - User-RP - User-NGI - RP-NGI - NGI-EGI But the NGI can play the role of agent for the User or the RP. ITIL definitions and terminology: - SLAs (service level agreements) are between organisations for delivering services - OLAs (operations level agreements) are between team/bodies in the same organisation - Other contracts: delivering subservices to suport SLAs. From an ITIL point of view, between User and NGI the agreement is an SLA. This is not necessarily legally binding (The end of TS's talk goes down to the technical ground) The PL-Grid model has EGE/EGI model as a starting point. If the initial goal is to provide services to users, this needs to be reflected in the model. Service Level Management is at the heart of PL-Grid model. Its technical implementation is done through a tool called Bazaar the tool re-uses the principles of schedulers to deal with SLAs. _________________________________________________________________________________________ 5 - Discussion and questions ----------------------------------------------------------------------------------------- TG: any experience of international VOs using this system? TS: we have an option for providing resources to an international VO but this is based on the assumption that there are polish users in it. In the case, the Polish group of the international VO plays the role of the user in the model. CL: what resources do you actually monitor? TS: mainly CPU and preparing for monitoring storage. VB: for me an SLA is legally binding. A grid service provider cannot provide a pledge without a pledge from the network provider. What about network in this picture? TS: In our case SLAs are just an agreement/reference. There is no legal binding because it would be difficult to do. Concerning network, we have general agreements with network providers (part of GEANT) and as for now this is enough. However there are some cases where it would be needed but it is not in place. GM: the model is described as being RA-centric. How does that translate in daily operations for NGI operators, is everything centered around the bazaar tool? TS: in terms of tools there is the PL-Grid portal with user registration, descriptions of SLAs, etc. In Bazaar there is the resource allocation part only. The rest of operations is done "normally". Our goal is to provide a single interface to users, not to operators. _________________________________________________________________________________________ 6 - morning wrap-up - structure of the afternoon's discussions ----------------------------------------------------------------------------------------- Identified needs and goals: - current situation is not good enough - Demonstrate that we have a policy to allocate resources and we control the situation. - Need of accountability. - re-equilbrate the distribution of resources between different communities - Efficiently deliver a "non trivial quality of service" Key concepts: - A priori and a posteriori analysis for RA - need of scientific coordination for taking decisions - bodies to identify - Introduction of a "user" dimension in the structure of operations Key Challenges: - Challenges linked to the grid context: - control is difficult - funding comes from multiple sources - need to keep the grid as a tool allowing new communities to join in and use resources - link local, national and international aspects Steps already done, to investigate and continue: - initiated collaboration with GENCI - initiated collaboration with PL-Grid Questions that emerged: - What is the Composition of the scientific commitees that decide on RA - how will we have to consider the international level - what is the return on investment Reminder: Questions we need to address by the end of the day: - What is the right legal term to agreements SLA/MoU/else? - What are the entities that need to be involved in the definition of agreements? - What do SLAs apply on? What are the “fields” and units? - France-Grilles context specificities/ international context - What kind of Service Level Management do we plan on? - Expected return of investment in terms of quality of service and traceability _________________________________________________________________________________________ 7 - Discussion - Answering open questions part.1 ----------------------------------------------------------------------------------------- HC: are networks an item for SLAs or not? Should we consider this aspect? Does the question make sense outside of the scope of LCG? CL: 2 consequences: - do you want to include network in the SLA? - Do you want to account for it? That comes done to the inclusion of network in the a priori or a posteriori discussion. TS: networking could be considered as another type of resource TS: How are the user communities positioned in this model? It seems that in France the user communities want to be involved in the resource allocation decision. In PL-Grid the final decisions are taken by the RPs. RR: in Vincent's presentation, the question was who decides if an SLA is allowed or not. Tomasz's view is who decides once the SLA is allowed. GM: There are 2 dimensions: the global pool of resources allocated to a VO, and how the VO shares its resources between its user groups. The 1st dimension correspond to what already exists in PL-Grid, while the 2nd adds this user dimension. They are 2 different types of agreements when it comes to who takes the decision. RR: technically this is possible to distribute CPUs according to groups within a same VO, although this has to be done by RPs. Where does a user go to ask for resources? 1./ to the VO itself 2./ to a site directly 3./ Grid at regional level 4./ NGI TS: How do VOs/VRCs want to collect requirements for resources? If we direct users to VOs, then VOs need to be able to collect these requirements. CL: question to Vincent: how does that work in HPC? VB: resources are allocated to projects, so the intra-project distribution is out of scope. Building a parallel with grids: in this case users go to the VO asking for resources, and the VO will negociate with RPs. RR: in EGEE there has been a case of an individual asking for resources at the project office itself - It created a security incident as it was for RSA cracking. So the situation can arise where an individual makes an application outside of the scope of a VRC. TS: draws a schema to clarify the 2 levels where agreements are needed: - within the VOs to distribute resources between groups - outside the VO to ask for resources to RPs. CL: the general question is: Who controls the resources? VB: In HPC this is clear: GENCI buys the machines so they control them. In the grid, the funding schema is very heterogeneous so this is difficult to assess who controls what. A RP can have resources funded by very different bodies. CL: if we don't define that then doing Resource Allocation doesn't make sense because we can't commit to anything. The mix of finances is different at each site so this is up to the site to dec ide which resources to dedicate to a national grid. TG: the NGI doesn't have to control the resources, they can just act as a relay. EM: what about having subgroups of a national VO and this VO managed by the NGI itself? RR: this would work for most of the yet-unknown projects. However I don't see how the 4 LHC VOs would fit in that. They have the resources, they have the financing for this, and they just don't care. Their relationship to the NGIs is not really strong. VB: it is probably going to change because the budget of the LCG is getting low. This might bring them towards collaboration. TG: this would mean all users are members of this VO, which is not true VB: Emmanuel's idea would not work for international VOs but would be fine for national communities. HC: summarising: that is the usecase of the NGI administering some of the resources. CL: to some extent this is an accounting thing. RR: considering resource distribution here on site (at CC-IN2P3), there is a clear separation between LCG and anything else. The anything else gets split between HEP and the rest. The sum of the "rest" cannot exceed a certain percentage. Here LCG is currently around 80%. The perspective of the site is different from the one of the NGI. Resource Allocation cannot work the same way. TS: there is here a very complicated model, with a lot of options... are we going to simplify this? It is probably not possible. So the other option is to figure out a model that integrates this complexity, and allows bodies to decide at all levels. RR: if we look at LCG, their model - that works - is to negociate resources with sites directly (e.g. computing resources). They go to NGIs for services (e.g. grid operators) So depending on the "resources" on which the agreement apply, the involved bodies might be different. TS: the conclusion of this discussion in gSLM is that there are different types of services, with or without capacity. RR: this would mean that a VO can go to an NGI to negociate all of its resources, or only part of it (e.g. monitoring, accounting etc.) TS: this is good because it clearly shows what are the responsibilities of the NGI, which is on services, while sites are responsibles for resources. HC: so, an SLA on services would go to the NGI, SLA on resources will go the sites? TS: mainly, although there are cases where an SLA on resources could go to the NGI e.g. for brokering GM: we have to be careful about the complexity of the model, if we want to integrate too many things from the start it might take long to reach something concrete. It might be better start with something, even incomplete, because in any case it would be better than current situation which is that we have nothing. TS: we are not starting from nothing, there are practices in place, funding etc. So we need to evolve from this towards something that can integrate other aspects and evolve. RR: a different approach would be to hide the complexity from the user. The interface to the user should be very simple, and on the other side there would be the complex part that deal with those complex things. -> Importance of the single point of contact VB: users attend a training, and at the end you can point them to existing VOs where they can deploy their project. And if they can't they can be directed to this national VO. RR: users should be directed, not have to find out by themselves how this works. They also should not be drowned in administrative procedures. VB: indeed it should be easy, and currently it is not. We should be concentrating on building this national VO to make it easy for new users to get to the grid. CL: a national VO is a good idea for new users, especially because there won't be big needs for resources. But when it comes to large allocation of resources this is not necessarily a good way. So there are 2 different levels of things, and trying to address different issues through the same solution is not necessarily good. _________________________________________________________________________________________ 8 - Discussion - Answering open questions part.2 ----------------------------------------------------------------------------------------- HC: What about the international context? CL/VB: Action for the future: pressurize EGI to provide policies at project level, in order to improve coordination between NGI. TS: There was such activity identified by EGI-DS, but decision has been taken to cancel that. HC: what are the impacts of the implementation of the strategy on France-Grilles operations? VB: The strategy needs to be presented to the scientific advisory commitee in March. We need to talk to sites to check what they think about this. Need to iterate on that. GM: so what is next concretely? are we going away and drafting a strategy, or do we try and figure out how we structure this drafting, getting e.g. feedback from the user community, sites, etc.? HC: the process of asking feedback can be complementary to drafting a first strategy. That comes back to the need of iterating on the question. CL: two aspects to draft, 2 different levels, 2 answers: - National VO - resource allocation for users - scientific commitee to evaluate projects and provide guidance HC: what about timelines? VB: something needs to be presented in March at the scientific advisory commitee. GM: yes but how far do we want to go before that? Should the strategy be fully defined by then? CL: the draft document should be a high level definition of where we want to go, gathering feedback on ideas and feasibility. VB: the key point is to get feedback from site about whether they would agree on giving a part of their resources to a national VO. CL: would 2 weeks be too short to agree on a draft ready to be circulated? _________________________________________________________________________________________ 9 - Wrap-up, conclusion and next steps ----------------------------------------------------------------------------------------- GM going through the initial set of questions, that have all been addressed somehow. CL: Another important point is that you need to have accounting on everything you have an SLA on. If you can't measure things, that's useless to make agreements on them. CL: this topic seems to be of interest for other NGIs, maybe this could result in a slot at the User Forum? Round the table for any addition to this. HC thanks the participants, especially Tomasz for coming from that far. Identified list of actions: --------------------------- - GM: circulate meeting minutes, as well as a synthetic summary of addressed key points Timeline: a.s.a.p. - HC: drafting the strategy, based on the outcome of this meeting Timeline: 2 weeks for drafting the strategy objective: 8/02 then gather feedback, iterations Final deadline: strategy proposal ready by mid March. - VB: gather feedback about the concept of a national VO from management bodies - VB: follow-up how we could integrate a discussion on resource allocation at the UF.