<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Electric Cloud Blog &#187; agent</title>
	<atom:link href="http://www.electric-cloud.com/blog/tag/agent/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.electric-cloud.com/blog</link>
	<description>This is your source for private development cloud best practices and technical tips and tricks for Electric Cloud solutions</description>
	<lastBuildDate>Thu, 02 Feb 2012 22:32:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
		<item>
		<title>Measuring ElectricAccelerator Cache Efficiency</title>
		<link>http://www.electric-cloud.com/blog/2009/03/11/measuring-electricaccelerator-cache-efficiency/</link>
		<comments>http://www.electric-cloud.com/blog/2009/03/11/measuring-electricaccelerator-cache-efficiency/#comments</comments>
		<pubDate>Wed, 11 Mar 2009 21:26:57 +0000</pubDate>
		<dc:creator>Eric Melski</dc:creator>
				<category><![CDATA[Electric Cloud Solutions]]></category>
		<category><![CDATA[agent]]></category>
		<category><![CDATA[ClearCase]]></category>
		<category><![CDATA[ElectricAccelerator]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://blog.electric-cloud.com/?p=249</guid>
		<description><![CDATA[Somebody asked me the other day, &#8220;How much does the ElectricAccelerator filesystem cache reduce I/O load on my build host?&#8221; This is an interesting question, because in some cases, the impact of Accelerator caching is a big part of the performance benefit. Consider the case of ClearCase dynamic views, which have notoriously bad performance, particularly [...]]]></description>
			<content:encoded><![CDATA[<p>Somebody asked me the other day, &#8220;How much does the ElectricAccelerator filesystem cache reduce I/O load on my build host?&#8221;  This is an interesting question, because in some cases, the impact of Accelerator caching is a big part of the performance benefit.  Consider the case of ClearCase dynamic views, which have <a href="http://www.perforce.com/perforce/comparisons/perforce_clearCase.pdf">notoriously</a> <a href="http://en.wikipedia.org/wiki/Rational_ClearCase#Weaknesses">bad</a> <a href="http://www.cmi.com/InfoCenter/Whitepapers/clearcase_bpd_summary.pdf">performance</a>, particularly for stat() operations.  By reducing the number of times the build accesses the host filesystem, Accelerator can provide a substantial performance boost.  In one extreme case, I saw a build that ran 50 times faster just by using a single Accelerator agent, because the the host filesystem was so slow.  In this post, I&#8217;ll show how you how to determine how much Accelerator caching is doing for your build.<br />
<span id="more-249"></span></p>
<p>
To evaluate Accelerator cache efficiency, we need to compare the total amount of file accesses performed by the build with the amount that actually ends up hitting the host filesystem.  The difference between these two values tells us how much of the total is served from the cache; in turn, the ratio between this difference and the total gives us the cache hit rate.  For this comparison, we&#8217;ll look at the following specific types of filesystem access:
</p>
<ul>
<li>The number of readdir() operations.
</li>
<li>The number of stat() operations.
</li>
<li>The amount of data read from disk.
</li>
<li>The amount of data written to disk.
</li>
</ul>
<p><h3>Counting total file accesses</h3>
<p>We can get the total number of accesses from <i>agent performance metrics</i>.  We&#8217;ve looked at these <a href="http://ecloud.wordpress.com/2008/10/17/digging-into-accelerator-agent-metrics-part-1/">once</a> or <a href="http://blog.electric-cloud.com/2008/11/10/electricaccelerator-agent-metrics-part-2/">twice</a> before.  Follow the directions <a href="http://ecloud.wordpress.com/2008/10/17/digging-into-accelerator-agent-metrics-part-1/">here</a> to obtain and aggregate metrics from the agents (make sure to get the latest version of the <a href="http://community.electric-cloud.com/download/attachments/4456525/agentsummary">agentsummary</a> script).  Once you have that summary data, the <code>Directory scans</code> &#8220;to EFS&#8221; metric in the <code>Caching</code> section gives you the total number of readdir() operations performed by the build:
</p>
<div style="background:#dee7f7;border:dashed thin;width:80ex">
<pre>
Caching:
  ...
  Directory scans:  15.0% (40 to EFS, 6 to emake)
  ...
</pre>
</div>
<p>
Next, you need the <i>raw</i> count of <code>Lookup</code> records from the <code>Usage records</code> section; this is the total number of stat() operations performed by the build:
</p>
<div style="background:#dee7f7;border:dashed thin;width:80ex">
<pre>
Usage records:
  ...
  Lookup              4029 ( 12.7%),   21.3 per job;   125881 raw (  3.2%)
</pre>
</div>
<p>
Finally, you need the <code>Total</code> MB values for <code>EFS disk reads</code> and <code>EFS disk writes</code> from the <code>Bandwidth</code> section; these give us the total amount of data read and written by the build:
</p>
<div style="background:#dee7f7;border:dashed thin;width:80ex">
<pre>
Bandwidth:
  ...
  EFS disk reads
    Locked:               295.4 MB, 223.6 MB/s active,   7.4 MB/s overall
    Unlocked:               0.0 MB,   0.0 MB/s active,   0.0 MB/s overall
    Total:                295.4 MB, 223.6 MB/s active,   7.4 MB/s overall
  EFS disk writes
    Locked:               218.9 MB, 280.6 MB/s active,   5.5 MB/s overall
    Unlocked:               0.0 MB,   0.0 MB/s active,   0.0 MB/s overall
    Total:                218.9 MB, 280.6 MB/s active,   5.5 MB/s overall
  ...
</pre>
</div>
<p><h3>Counting host filesystem accesses</h3>
<p>Now we need to find out how much of the I/O activity actually hit the host filesystem &#8212; all the accesses that were not serviced by the cache.  You&#8217;ll find this data in the <i>emake performance metrics</i>, which we&#8217;ve looked at a bit <a href="http://ecloud.wordpress.com/2009/01/07/electricmake-temporary-directory-settings/">previously</a>.  To obtain these metrics, you need to enable emake performance logging by adding <code>--emake-debug=g --emake-logfile=emake.dlog</code> to your emake command-line options.  When your build completes, you&#8217;ll find the metrics in the file emake.dlog.  First, we need the <code>DirCache readdirs</code> and <code>DirCache stats</code> values in the <code>Counter values</code> section; these give us the number of readdir() and stat() operations that hit the host filesystem, respectively:
</p>
<div style="background:#dee7f7;border:dashed thin;width:80ex">
<pre>
Counter values:
  ...
  DirCache readdirs         10
  DirCache stats           890
  ...
</pre>
</div>
<p>
Next, we need the <code>To disk</code> and <code>From disk</code> data from the <code>Bandwidth</code> section; these tell us the amount of data written to and read from the host filesystem:
</p>
<div style="background:#dee7f7;border:dashed thin;width:80ex">
<pre>
Bandwidth:
 ...
 To disk:             167.6 MB,  28.6 MB/s active, 4.1 MB/s overall
 From disk:            29.1 MB,   6.2 MB/s active, 0.7 MB/s overall
 ...
</pre>
</div>
<p><h3>Computing cache efficiency</h3>
<p>Now we can put all the numbers together and compute the cache efficiency:
</p>
<table rules="rows columns" border cellpadding="8">
<tr style="background:#e0e0e0">
<th align="left">Metric</th>
<th align="right">Total accesses</th>
<th align="right">Host FS accesses</th>
<th align="right">Cache hit rate</th>
</tr>
<tr style="background:#ffffce">
<td align="left">readdir() operations</td>
<td align="right">40</td>
<td align="right">10</td>
<td align="right">75%</td>
</tr>
<tr style="background:#deffde">
<td align="left">stat() operations</td>
<td align="right">125,881</td>
<td align="right">890</td>
<td align="right">99%</td>
</tr>
<tr style="background:#ffffce">
<td align="left">MB read from disk</td>
<td align="right">295</td>
<td align="right">29</td>
<td align="right">90%</td>
</tr>
<tr style="background:#deffde">
<td align="left">MB written to disk</td>
<td align="right">219</td>
<td align="right">168</td>
<td align="right">23%</td>
</tr>
</table>
<p>
The effect is dramatic, even on the small build I used for this example.  This is why some people call emake a &#8220;ClearCase accelerator&#8221;.  Of course, every build will have a slightly different profile, but in general you ought to see similar results.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.electric-cloud.com/blog/2009/03/11/measuring-electricaccelerator-cache-efficiency/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>ElectricAccelerator Agent Metrics, part 2</title>
		<link>http://www.electric-cloud.com/blog/2008/11/10/electricaccelerator-agent-metrics-part-2/</link>
		<comments>http://www.electric-cloud.com/blog/2008/11/10/electricaccelerator-agent-metrics-part-2/#comments</comments>
		<pubDate>Mon, 10 Nov 2008 19:48:51 +0000</pubDate>
		<dc:creator>Eric Melski</dc:creator>
				<category><![CDATA[Electric Cloud Solutions]]></category>
		<category><![CDATA[agent]]></category>
		<category><![CDATA[ElectricAccelerator]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[usage]]></category>

		<guid isPermaLink="false">http://ecloud.wordpress.com/?p=72</guid>
		<description><![CDATA[ElectricAccelerator agent metrics provide a tremendous amount of data that you can use to analyze and improve the performance of your builds. Last time we saw how to collect the metrics and we explored the data presented in the Overall time usage section. This time, we&#8217;ll look at the data in the Usage records section. [...]]]></description>
			<content:encoded><![CDATA[<p>ElectricAccelerator agent metrics provide a tremendous amount of data that you can use to analyze and improve the performance of your builds.  <a href="http://ecloud.wordpress.com/2008/10/17/digging-into-accelerator-agent-metrics-part-1/">Last time</a> we saw how to collect the metrics and we explored the data presented in the <strong>Overall time usage</strong> section.  This time, we&#8217;ll look at the data in the <strong>Usage records</strong> section.</p>
<p><span id="more-72"></span></p>
<h3>Usage records</h3>
<p>As a build is executed by Accelerator, the Electric File System tracks all filesystem and registry accesses made by commands in the build.  This <em>usage log</em> is the means by which modifications made on cluster nodes are propagated back to the emake host, and it also forms the basis for our conflict detection algorithm.  Because it is such a core piece of the system, we track numerous metrics specifically related to the usage log.  Those metrics are reported in the <strong>Usage records</strong> section of the agent performance metrics.</p>
<p>After your build completes, use <code>cmtool</code> and <code>agentsummary</code> to get the agent metrics, then find the <strong>Usage record</strong> section.  It looks something like this:</p>
<div style="border:thin dashed;background:#dee7f7 none repeat scroll 0 50%;width:80ex">
<pre>Usage records:
  Append                 0 (  0.0%),    0.0 per job;       25 raw (  0.0%)
  Blind create           0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Blind truncate         0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Create              4835 (  2.5%),    3.7 per job;     5425 raw ( 89.1%)
  Create key             0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Create(dir)          380 (  0.2%),    0.3 per job;      476 raw ( 79.8%)
  Delete key             0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Delete value           0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Failed lookup     112444 ( 59.3%),   85.8 per job;        0 raw (  0.0%)
  Link                 760 (  0.4%),    0.6 per job;        0 raw (  0.0%)
  Lookup             19715 ( 10.4%),   15.0 per job;  7507140 raw (  0.3%)
  Lookup key          1426 (  0.8%),    1.1 per job;     8385 raw ( 17.0%)
  Modify                33 (  0.0%),    0.0 per job;       42 raw ( 78.6%)
  Modify atts            1 (  0.0%),    0.0 per job;     1714 raw (  0.1%)
  New name            4080 (  2.2%),    3.1 per job;   116978 raw (  3.5%)
  Read               43476 ( 22.9%),   33.2 per job;    85315 raw ( 51.0%)
  Read key               0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Read value          2274 (  1.2%),    1.7 per job;     2274 raw (100.0%)
  Rename                 0 (  0.0%),    0.0 per job;       11 raw (  0.0%)
  Rename(dir)            0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Set value              0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Set value if           0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Truncate               7 (  0.0%),    0.0 per job;       32 raw ( 21.9%)
  Unlink                 0 (  0.0%),    0.0 per job;        0 raw (  0.0%)
  Unlink(last)         225 (  0.1%),    0.2 per job;      910 raw ( 24.7%)
  Total             189656 (100.0%),  144.7 per job;  7728727 raw (100.0%)</pre>
</div>
<p>As you can see, Accelerator tracks a wide variety of file and registry operations:</p>
<table border="1" rules="rows">
<tbody>
<tr>
<th align="left">Append</th>
<td>Adding data to the end of an existing file.</td>
</tr>
<tr>
<th align="left">Blind create</th>
<td>Creating a new file.</td>
</tr>
<tr>
<th align="left">Blind truncate</th>
<td>Completely overwriting an existing file.</td>
</tr>
<tr>
<th align="left">Create</th>
<td>Creating a new file.</td>
</tr>
<tr>
<th align="left">Create key</th>
<td>Creating a registry key (Windows only).</td>
</tr>
<tr>
<th align="left">Create(dir)</th>
<td>Creating a directory.</td>
</tr>
<tr>
<th align="left">Delete key</th>
<td>Removing a registry key (Windows only).</td>
</tr>
<tr>
<th align="left">Delete value</th>
<td>Removing a registry value (Windows only).</td>
</tr>
<tr>
<th align="left">Failed lookup</th>
<td>stat() on an non-existent file.</td>
</tr>
<tr>
<th align="left">Link</th>
<td>Creating a hard link to an existing file.</td>
</tr>
<tr>
<th align="left">Lookup</th>
<td>stat() on an existing file.</td>
</tr>
<tr>
<th align="left">Lookup key</th>
<td>Lookup (or attempt to lookup) a registry key.</td>
</tr>
<tr>
<th align="left">Modify</th>
<td>Rewrite a portion of the contents of an existing file.</td>
</tr>
<tr>
<th align="left">Modify atts</th>
<td>Change the attributes of a file, such as access permissions.</td>
</tr>
<tr>
<th align="left">New name</th>
<td>Does not correspond to a specific operation; this is a placeholder used to facilitate reporting usage to emake.</td>
</tr>
<tr>
<th align="left">Read</th>
<td>Read the contents of a file.</td>
</tr>
<tr>
<th align="left">Read value</th>
<td>Read (or attempt to read) the contents of a registry value (Windows only).</td>
</tr>
<tr>
<th align="left">Rename</th>
<td>Rename a file.</td>
</tr>
<tr>
<th align="left">Rename(dir)</th>
<td>Rename a directory.</td>
</tr>
<tr>
<th align="left">Set value</th>
<td>Set a registry value (Windows only).</td>
</tr>
<tr>
<th align="left">Set value if</th>
<td>Set a registry value, with additional constraints on the value (Windows only).</td>
</tr>
<tr>
<th align="left">Truncate</th>
<td>Overwrite the contents of an existing file.</td>
</tr>
<tr>
<th align="left">Unlink</th>
<td>Remove a hard link to a file or directory.</td>
</tr>
<tr>
<th align="left">Unlink(last)</th>
<td>Delete a file or directory.</td>
</tr>
</tbody>
</table>
<p>For each type of operation, the agent reports the following data:</p>
<ul>
<li>Total number of records of that type reported to emake.</li>
<li>Portion of the total usage records reported represented by that type.</li>
<li>Average number of records of that type per job in the build.</li>
<li>Total number of raw records of that type reported by the EFS.</li>
<li>Portion of the raw records of that type reported to emake by the agent.</li>
</ul>
<h3>Why are there so many types of operations?</h3>
<p>As of Accelerator 4.3.0, the EFS and agent track 25 different types of operations, almost double the variety that we tracked in Accelerator 3.5.0.  Many of the different types are fairly subtle variations on one another.  For example, the distinction between <strong>Create</strong> and <strong>Blind create</strong> is simply the flags used in the system call that created the file.  Similarly, <strong>Create</strong> could be considered a special case of <strong>Modify</strong>.  The question inevitably arises:  wouldn&#8217;t it be simpler to track fewer types of operations?</p>
<p>The answer of course is yes, it would be simpler, but that simplicity would come at the cost of superior efficiency.  For example, although <strong>Create</strong> is a special case of <strong>Modify</strong> usage, it has one very significant difference:  Modify usage implies that a command used the previous contents of the file before making changes, but Create implies that the command did not use the previous contents.  This seemingly trivial distinction is the difference between commands that must be serialized and commands that can run in parallel:  two commands that modify the same file must be serialized (think of compile steps that each update a shared .PDB file), while two commands that each overwrite the same file can be run in parallel.</p>
<h3>Reported Usage versus Raw Usage</h3>
<p>You may have noted that there is sometimes a significant difference between the raw usage reported by the EFS to the agent, and the usage ultimately reported by the agent to emake.  The simple explanation is that the agent has some smarts to eliminate redundant usage records, in order to minimize the number of records emake must manage and examine for conflicts.</p>
<p>A trivial example is eliminating redundant <strong>Read</strong> operations from the usage:  the first time a file is read during a given job, the agent logs the usage.  The second time the file is read during the same job, the agent discards the usage because it is redundant with the usage already logged for the file.</p>
<p>A more sophisticated example is eliminating <strong>Read</strong> operations on new files:  when a file is created during a job, the agent logs the usage.  If the file is later read during the same job, the agent discards the usage &#8212; the read cannot impact the dependency analysis that emake will perform, since it is reading data that was generated in the same job, rather than data generated during a different job in the build.</p>
<p>By volume, the most significant application of this technique is in the reduction of lookup operations.  It&#8217;s astonishing just how many lookup operations occur in the course of an average build, and even more so how many of them are redundant.  In the sample shown above, only a tiny fraction of the lookups were actually significant.  The rest were discarded by the agent before ever being sent to emake, dramatically reducing the amount of data emake must manage.  If anything, the sample shown here is on the low end in terms of the number of lookups generated &#8212; most builds generate far more.  One customer&#8217;s build generated about 200 million lookups during a 50 minute long build.</p>
<p>Some types of usage, such as failed lookups, are not actually directly reported by the EFS.  Instead, these types of records are synthesized by the agent to replace longer sequences of operations &#8212; again, in order to reduce the amount of data that emake must track.</p>
<h3>What&#8217;s normal?</h3>
<p>The usage profile of most builds is surprisingly similar:  the vast majority of usage is lookups (both failed and successful); reads are the next most common, followed by creates.  Just about everything else occurs so infrequently that it&#8217;s not worth calling out explicitly.  To put concrete values on it:</p>
<table border="1" cellpadding="4" rules="rows">
<tbody>
<tr>
<th align="left">Failed Lookup</th>
<td>50% &#8211; 60%</td>
</tr>
<tr>
<th align="left">Read</th>
<td>25% &#8211; 30%</td>
</tr>
<tr>
<th align="left">Lookup</th>
<td>10% &#8211; 15%</td>
</tr>
<tr>
<th align="left">Create</th>
<td>2% &#8211; 10% (NB &#8212; usually 1 or 2 per job in the build)</td>
</tr>
<tr>
<th align="left">Everything else</th>
<td>2% &#8211; 5%</td>
</tr>
</tbody>
</table>
<h3>How can I use this data?</h3>
<p>There are two ways you can use the usage metrics to improve build performance.  First, you can use the usage profile as a simple &#8220;gut check&#8221; to spot anomalous behavior.  For example, a build that logs significantly more <strong>Create</strong> usage than others may be doing unnecessary additional work, such as building targets repeatedly.  A build that logs excessive <strong>Read</strong> usage may indicate problems like badly factored header files which cause every compile to read every header.  It&#8217;s hard to predict exactly what the problems might be; the key is to keep your eyes open for anything out of the ordinary.</p>
<p>The second consideration relates specifically to failed lookup usage.  As you&#8217;ve seen, failed lookups account for the majority of all usage.  Earlier I noted that failed lookups correspond to stat()-like operations on non-existent files, but what does that really mean?</p>
<p>The most common source of failed lookups is include path searches.  Suppose your build specifies an include search path to the compiler, for example:</p>
<div style="border:thin dashed;background:#dee7f7 none repeat scroll 0 50%;width:80ex">
<pre>... -I/tools/megawidgets/include -I/tools/apache/include \
  -I/tools/expat/include -I/tools/tcl/include -I/tools/perl/include ...</pre>
</div>
<p>Now, imagine your source files contain #include directives like this:</p>
<div style="border:thin dashed;background:#dee7f7 none repeat scroll 0 50%;width:80ex">
<pre>#include "math.h"</pre>
</div>
<p>When the compile looks for the file <code>math.h</code>, it will search for it in each directory specified in the include search path.  Of course, the file exists in only one directory, so each attempt to find it in the other directories generates failed lookup usage.  Now imagine a system in which the include search path contains dozens of directories.  Every compile step in the build will search each of those directories for each header file included.  You can see how quickly it adds up:</p>
<div style="border:thin dashed;background:#dee7f7 none repeat scroll 0 50%;width:80ex">
<pre>      500 compiles
  *    20 headers per compile
  *    50 dirs in include path
  ----------------------------
  500,000 failed lookups</pre>
</div>
<p>Depending on the build, there may be nothing you can do about this.  But if you are able to, you could investigate optimizing the include search path:</p>
<ul>
<li>Eliminate stale entries.</li>
<li>Reorder the include path so that more commonly used headers are found earlier.</li>
<li>Customize the include path per target.  Some builds use a single include path for all compile steps, but if the <code>megawidgets</code> headers (for example) are used only by some targets, adding that directory to the include path for the other targets only bloats the search space.</li>
</ul>
<p>Unfortunately, it&#8217;s hard to predict exactly how much impact changes like these could have on your build, although I can say that my experience has consistently shown that reducing the size of your build in any dimension often results in surprising performance benefits due to second-order effects.  I encourage you to go for the &#8220;low hanging fruit&#8221; &#8212; make those changes that are easy to make and measure the impact.  If the modifications pay off, find the next lowest hanging fruit and repeat the process.  Keep going as long as the cost of making the changes is less than the benefit you see.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.electric-cloud.com/blog/2008/11/10/electricaccelerator-agent-metrics-part-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ElectricAccelerator Agent Metrics, part 1</title>
		<link>http://www.electric-cloud.com/blog/2008/10/17/digging-into-accelerator-agent-metrics-part-1/</link>
		<comments>http://www.electric-cloud.com/blog/2008/10/17/digging-into-accelerator-agent-metrics-part-1/#comments</comments>
		<pubDate>Fri, 17 Oct 2008 21:26:36 +0000</pubDate>
		<dc:creator>Eric Melski</dc:creator>
				<category><![CDATA[Electric Cloud Solutions]]></category>
		<category><![CDATA[agent]]></category>
		<category><![CDATA[ElectricAccelerator]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://ecloud.wordpress.com/?p=10</guid>
		<description><![CDATA[Welcome to the Electric Cloud Blog! My name is Eric Melski, and I&#8217;m a Senior Software Engineer at Electric Cloud (employee #1, by a whisker). I&#8217;ve worked on every component of ElectricAccelerator, and I&#8217;m now Technical Lead for the product. I&#8217;m also the guy behind ElectricInsight. In this and future posts, I will take you [...]]]></description>
			<content:encoded><![CDATA[<p>Welcome to the Electric Cloud Blog!  My name is Eric Melski, and I&#8217;m a Senior Software Engineer at Electric Cloud (employee #1, by a whisker).  I&#8217;ve worked on every component of ElectricAccelerator, and I&#8217;m now Technical Lead for the product.  I&#8217;m also the guy behind ElectricInsight.  In this and future posts, I will take you on deep technical dives into Accelerator and Insight, with a focus on performance, scalability and analysis.</p>
<p>Understanding the performance of parallel, distributed systems can be difficult.  Fortunately, Accelerator provides a wealth of data to facilitate performance analysis.  In this first continuation of my presentation at the 2008 Customer Summit, I&#8217;ll show you how to collect <em>agent performance metrics</em>, and how you can use them to get a quick overview of the performance characteristics of your build.</p>
<p><span id="more-10"></span></p>
<h3>Collecting Agent Metrics</h3>
<p>Every agent in the cluster automatically tracks performance metrics for every build that it participates in, so you don&#8217;t have to do anything special to enable them.  All you need to do is download them from the agents after your build has finished.  Note that by default, the agent keeps performance data for the previous 100 builds that it participated in, so you don&#8217;t have to grab the metrics immediately after your build completes.</p>
<p>The easiest way to get the metrics from the cluster is via <code>cmtool</code>:</p>
<div style="background: none repeat scroll 0% 0% #deffde;border: thin dashed;width: 80ex">
<pre>cmtool --cm=... runAgentCmd "session performance <em>buildId</em><em> " &gt; </em><em>buildId</em>.agentraw
</pre>
</div>
<p>This will write a dump of performance metrics from each agent that participated in the specified build to the file named <em>buildId</em>.agentraw.  Next, you&#8217;ll want to use the (unsupported) <a href="https://electriccloud.zendesk.com/attachments/token/i42zqwp0ccngad6/?name=agentsummary"><code>agentsummary</code></a> utility to produce an aggregate summary of the metrics across all agents:</p>
<div style="background: none repeat scroll 0% 0% #deffde;border: thin dashed;width: 80ex">
<pre>tclsh agentsummary <em>buildId.agentraw</em> &gt; <em>buildId.agentsum</em>
</pre>
</div>
<p>Now we have something that we can dig into.</p>
<h3>Overall time usage</h3>
<p>The first thing to look at in the agent performance metrics is the &#8220;Overall time usage&#8221; section.  This divides the time used by the agents into several coarse-grained buckets:</p>
<div style="background: none repeat scroll 0% 0% #dee7f7;border: thin dashed;width: 80ex">
<pre>Overall time usage:
  Startup:           4.65s ( 0.0%)       20 intvls, avg   232.5ms
  Cmd setup:       253.15s ( 0.4%)    42281 intvls, avg     6.0ms
  Command:       36479.41s (61.7%)    82823 intvls, avg   440.5ms
  Emake request:  7732.26s (13.1%)   763879 intvls, avg    10.1ms
  A2A request:    1968.67s ( 3.3%)    52112 intvls, avg    37.8ms
  Command end:     448.59s ( 0.8%)    82823 intvls, avg     5.4ms
  Return:         4874.27s ( 8.2%)    42281 intvls, avg   115.3ms
  Idle:           6726.62s (11.4%)    42261 intvls, avg   159.2ms
  End:             671.13s ( 1.1%)
</pre>
</div>
<p>These timers correspond to the following activities:</p>
<table border="1" rules="rows">
<tbody>
<tr>
<th align="left">Startup</th>
<td>Waiting for the first commands from emake after the connection is established.</td>
</tr>
<tr>
<th align="left">Cmd setup</th>
<td>Preparing the execution context (environment, working directory, etc.) for a build command</td>
</tr>
<tr>
<th align="left">Ver updates</th>
<td>Processing file version updates from emake</td>
</tr>
<tr>
<th align="left">Command</th>
<td>Running build commands (compiles, links, etc)</td>
</tr>
<tr>
<th align="left">Emake request</th>
<td>Waiting for emake to service a request for file data or metadata</td>
</tr>
<tr>
<th align="left">A2A request</th>
<td>Waiting for file data transferred from other agents via P2P</td>
</tr>
<tr>
<th align="left">Command end</th>
<td>Waiting for the instructions from emake after finishing execution of a build command.</td>
</tr>
<tr>
<th align="left">Return</th>
<td>Reporting file usage to emake, including the contents of any files created</td>
</tr>
<tr>
<th align="left">Idle</th>
<td>Waiting for emake to provide more commands to run</td>
</tr>
<tr>
<th align="left">End</th>
<td>A special case of Idle time, this is the amount of time between the last build command and the end of the agent&#8217;s participation in the build</td>
</tr>
</tbody>
</table>
<p>For each timer, the metrics show:</p>
<ul>
<li>The total time spent in the timer across all agents;</li>
<li>The portion of time spent in the timer as a percentage of the total time across all agents;</li>
<li>The number of times the timer was active;</li>
<li>The average amount of time the timer was active each time it was in use.</li>
</ul>
<p>Note that not all timers are available in all versions of Accelerator.</p>
<p>Of these timers, <strong>Command</strong> is perhaps the most interesting, because it represents the actual work performed during your build.  The other timers all correspond to inefficiency of some sort, whether due to the overhead introduced by running a distributed build across the network, or due to serializations in the architecture of the build itself.</p>
<p>One interesting factoid about the <strong>Command</strong> timer is that you can use its percentage to estimate the <em>X factor</em>, or speedup relative to a serial build.  Simply multiply the Command percentage by the number of agents that participated in your build.  For example, if the Command timer represents 65% of the total time, and your build used 8 agents, then the X factor is about 5.2x:</p>
<div style="background: none repeat scroll 0% 0% #dee7f7;border: thin dashed;width: 80ex">
<pre>0.65 * 8 == 5.2x better than serial
</pre>
</div>
<p>Two caveats about this value:  first, it&#8217;s only an estimate.  In my experience it seems to be reasonably accurate in the back-of-the-envelope sense, but it&#8217;s by no means perfect.  Second, it only really works if the agents are running on hardware comparable to the system you would have used for a serial build.  If the agents are slower or faster than the baseline host, that will skew the number correspondingly.</p>
<h3>Are my values &#8220;good&#8221;?</h3>
<p>Of course the whole point of looking at these timers is to determine if your build is performing well, or if there is room for improvement.  Therefore, we need to establish some guidelines for evaluating the values in these timers.  To that end, here are some very rough guides:</p>
<table border="1" cellpadding="4" rules="all">
<tbody>
<tr>
<td></td>
<th>Good</th>
<th>Acceptable</th>
<th>Warning!</th>
</tr>
<tr>
<th align="left">Command</th>
<td bgcolor="#deffde">&gt; 60%</td>
<td bgcolor="#dee7f7">50% &#8211; 60%</td>
<td bgcolor="#ffcfce">&lt; 50%</td>
</tr>
<tr>
<th align="left">Emake request</th>
<td bgcolor="#deffde">&lt; 10%</td>
<td bgcolor="#dee7f7">10% &#8211; 15%</td>
<td bgcolor="#ffcfce">&gt; 15%</td>
</tr>
<tr>
<th align="left">Idle/End</th>
<td bgcolor="#deffde">&lt; 10%</td>
<td bgcolor="#dee7f7">10% &#8211; 20%</td>
<td bgcolor="#ffcfce">&gt; 20%</td>
</tr>
<tr>
<th align="left">Return</th>
<td bgcolor="#deffde">&lt; 10%</td>
<td bgcolor="#dee7f7">10% &#8211; 20%</td>
<td bgcolor="#ffcfce">&gt; 20%</td>
</tr>
</tbody>
</table>
<p>Remember, <strong>Command</strong> time represents all the time spent doing the actual work in your build.  Ideally, we&#8217;d like to see that at 100%, but of course that target is unattainable &#8212; if nothing else, there is always overhead introduced simply because we have to move data across the network.  Still, it&#8217;s not uncommon to see builds with Command time representing 80% or more of the total time.  Anything over 60% is pretty good, really.  Less than 50% means there&#8217;s probably room for improvement.  That&#8217;s not a guarantee that you can do better, of course.  It may simply be the nature of the beast, or you may find that the cost of improving the performance outweighs the benefit.  If you do see a low percentage for Command time, the thing to do is look at the other timers to determine where the time is going instead.  That will in turn provide some guidance to the root cause of the problem.</p>
<p>For example, if the <strong>Emake request</strong> timer is high (greater than 15%), that indicates a problem with the communication between emake and the agents.  The things to check next in this case are:</p>
<ul>
<li>Network bandwidth between emake and the agents.</li>
<li>System load on the emake host.</li>
<li>Disk performance on the emake host.</li>
<li>Number of agents participating in the build (too many agents can cause emake to thrash).</li>
</ul>
<p>Another possibility is that the <strong>Idle</strong> and <strong>End</strong> timers are high (more than 20% combined).  This usually indicates that your build has significant serializations, or that it had a large number of conflicts.  Use ElectricInsight to investigate further.</p>
<p>Finally, if the <strong>Return</strong> timer is high (greater than 20%), you&#8217;ll want to check:</p>
<ul>
<li>Network bandwidth between emake and the agents.</li>
<li>Disk performance on the emake host.</li>
<li>Disk performance on the agent host.</li>
</ul>
<p>It&#8217;s important to note that in general, any one metric will not exactly pinpoint a specific problem; rather, each metric should be thought of as a diagnostic tracer.  The more tracers you combine, the more precise your diagnosis will be.</p>
<h3>Next time</h3>
<p>There is lots of other interesting data in the agent performance metrics which you can use to further analyze build performance.  In my next post, we&#8217;ll look at the <strong>Usage records</strong> section.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.electric-cloud.com/blog/2008/10/17/digging-into-accelerator-agent-metrics-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

