**Speakers: **Luis Villegas, Sean Shypula from Bungie
Distributed client/server system
- split up, runs it in parallel * processes user submitted tasks in parallel * 180 rackmounted machines 300 processes * can use *
Advantages
- speed up time consuming tasks - (rendering goes from day to a few hours) * see results of work more frequently which means more iteration which enables adding more polish * Automates complex processes and reduces human error, ( * click a button and get email when job is complete
Main processes on the farm
- 3 main * Binary builds - game exes and tools * Lightmap rendering * All of the levels static lighting is baked into the map files * precomputed lighting * baked into level files * Content builds * Raw assets into monolithic level files that ship on disc * Other tasks - shaper compilation, cubemap rendering, production builds of bungie.net Web site, jobs that patch machines (OS, administrative tasks)
Bungie Farm
- 3rd iteration * Halo 1 - Asset process done by hand, little automation * Halo 2 - automated different systems and distributed complex tasks - automate binary and lightmap systems, but they were different systems * Halo 3 - Unified systems into a single extensible system - unify all the systems
Achieved During Halo 3
- Unified codebases implemented a single system that is flexible and generic * Unified server pools, one farm for all * Updated the technology to .NET (rewrote in C#), the goal there was to make it as easy as possible to develop and maintain
What our system has done
- 50,000 jobs * 11K binary builds * 9K lightmap jobs * 28K job of other types * Huge timesaver and reduces artist/dev time
End User experience
- Make it as easy to use as possible, press a button and magic happens * Users get the result back
Interfaces (Build)
- Web based tools and RSS enabled * Build running on system, kick off new build * Status - shows status of each of the build configs, shows red if it fails and shows log for each build * Changes - would see a list of files that changed * Shows permachine status - Idle or not
Random message on Bungie slides: non facete nobis calcitrare vestrum
Designer - Kicking off lightmap jobs from their tools
- Lightmap Monitor UI - View status of all maps in game whether they are up-to-date, which sections still need to be done
Architecture
- Single system with multiple workflows * Plug-in based * Workflows divided into client/server based * Single centralized server, multiple client * Not peer-to-peer, just communicate with server * Server manages each job’s state including serializing/persisting state * Communication is doing using SQL Server
Information Flow
- Web server> SQL Server> controller server > farm
Binary Build site
- Automates code compilation, automated test process * Create a snapshot of source tree and symbols for each build * Default is incremental buids (diffs) * continuous integration and scheduled builds * Devs do on-demand, scheduled builds are run at night * Builds take 15 minutes on the farm
Debugging improvement
- manual process of debugging (finding/copying files before attaching to box) * Get rid of manual steps * Use Symbol Server - Debugging Tools for Windows * Symbols registered on a server, registered by the build site once all configurations finish * Source Stamping (Visual Studio) * Linker setting to specify the official location of that build’s source code (/SOURCEMAP) * Step through code and VS will automatically grab the code and pull it down * Engineer can attach to any box from any machine with VS installed * Correct source and symbols downloaded automatically
Lightmap Farm
- [shows beautiful before/after shots] * Most consuming farm process * Lightmapper was written specifically to be run on farm * Specify a chunk of work per machine (distribute work) * Merge the results * Simple load-balancing scheme * Each job can be configured
Cubemap Farm
- Used for in-game reflection * requires to run on Xbox dev kits, expanded farm to include Xbox dev kits
All slides are available on bungie.net
Implementation Details
- C# and .NET, very pleased with the decision * Stick with C# for tools development in the forseeable future
.NET XML Serialization
- Originally chose an XML Serialization scheme - ran into issues * .NET dynamically creates a DLL for each serialization type and loads its own appdomain, some A/V software could lock during serialiation calls * Moved to binary serialization, faster, used less memory, consumed less DB space
Memory Management
- GC - Server memory could grow out of control or even cause crashes, GC would only happen under really high memory pressure, by that point slowdowns already occur * Workaround: explicit GC, be smart about it, do it right after a task is complete * Bottom line: still need to keep memory usage in mind
Plug-ins
- Each workflow implemented as client/server plug-ins * Each plug-in is a DLL * Isolate failures to a single DLL, if job/plug-in crashes, all other jobs are unaffected * Only kept a single active job in memory at a time * Inactive jobs are serialized into DB * If there was a crash, remove the job and move on to the next one
SQL Messaging
- Senders post tot a table - recipent polls table * Benefits * transactional, fault tolerant * Drawbacks * Difficult scaling to multiple clients * SQL DB maintenance (if DB went down, whole farm stopped) * Messages aren’t immediately received
Future Development
- Dynamic allocation of machines for certain tasks (build/lightmap job that was a priority and needed to be rushed through) * Ability to restart a job from a specific point * Improve admin tools * Create a test farm * Extend systems to idle PCs * WCF - for communication - could replace SQL messaging system we have * WF - Workflow foundation - farm is essentially a collection of workflows
Implementing a Distributed Farm
- Don’t need a very large farm to get benefits of automation/distribution * Farm Middleware packages - Starting from scratch, would consider middleware packages (didn’t exist or weren’t mature enough when we started) * Automate simple but widely used tasks, 1 or 2 PCs to run jobs, build process is a great system to start with * Focus on usability
Q: How do you take advantage of multiproc machines? A: Farm code is multithreaded
Q: How many people oversee farm? A: It’s me, takes a significant portion of my time
**Final **- Bungie would not have been able to ship Halo 3 at the same quality level with out the farm in place. Studio iteration time and efficiency are key.