Skip to content

Troubleshooting a vCenter Database that is too big

I have a vCenter 5.1 database that has grown to almost 50GB in size.  Looking at the database sizing in the vSphere (C#) client (yes I know I should be using the web client!) under “Administration” menu then “vCenter Server Settings” then “Statistics” the estimated size of the inventory was below 15GB (and this was with many more hosts and VMs than I am running!).

So first thing to do here is check you statistics level, are you showing level 1 for all the fields?

vcenter db too large estimated database size

The next thing to do is click the “Database Retention Policy” Settings page.  When you install vCenter the default settings will look like this:

vcenter db too large default vcenter retention policy

You need to tick both of the boxes and put in a sensible number (I have blogged about this before).  Talking to VMware it seems most customers set this to 30 days.

An interesting fact I never knew before about this setting is that when you change it it does not retrospectively apply the retention setting to older entries in the database.  So if you ran with a setting of 365 days for a period of time then these records will still age out after 365 days even if you subsequently make a change to 30 days.

Now to check exactly where the space is going in your database.  The go to technical document for this is 1028356.  This will show you how to check if size is going to the database (mdf) file or the log (ldf) file.  Again, most customers choose a “Simple Logging” setting for the vCenter database to ensure the log files stay small.

Later in the document is an excellent script you can execute to get the exact table sizes and it sorts by largest (in MB).

select object_name(id) [Table Name],
[Table Size] = convert (varchar, dpages * 8 / 1024) + ‘MB’
from sysindexes where indid in (0,1)
order by dpages desc

In my case adding up all of the entries only came to 16GB (copy and paste the lot to Excel, search for MB and replace that with nothing to loose the MB from the numbers and then sum all of the numbers).  So where is my space?  It is white space in the DB and this will only be removed if a shrink of the database is run (see technical document 1036738 for details).

For me the top tables were  VPX_EVENT_ARG and VPX_EVENT indicating a lot of activity in events for servers.  These, of course, can be viewed from the “Tasks and Events” tab when a host/cluster/DC is selected in the vSphere Client or you could select the top 1000 rows from SQL Management Studio (right click on the table you are interested in and then choose “Select Top 1000 Rows”).

I have two main reasons for a large number of events:

  1. We have a backup product that constantly logs into and out of the hosts (nothing we can do about this).
  2. We also have an issue on ESXi 5.1 hosts where a HP component called hp-ams is constantly logging into the host.  Note this is on ESXi 5.1 servers and I have not seen this on 5.5.  See my previous post discussing this here which also includes a link to the HP advisory.

So for now I have a setting of 30 days retention on tasks and events.  VMware have provided a script to completely clear the tasks and events log and have advised this should be run prior to shrinking the database if this is what I wanted to do (I could just shrink and this would be okay for me).

I will provide the script details below but please note the following:

  •  You run this at your own risk.
  •  I would advise you contact VMware support before contemplating running it.  They are very helpful people and will check this is the correct thing for you to do and will also run it for you if you are not comfortable with it.
  • If you do not know how to recover from a problem/corrupt vCenter database do not proceed.
  • I was given the script to run against a vCenter 5.1 database I cannot confirm if this will work for other versions.
  • I haven’t run the script myself (yet)!
  • You must shut down your vCenter components and backup your database (see here) before running the script.  Note non availability of vCenter may impact your backup systems and will affect your ability to manage the system.  When you have run this then shrink the DB and restart VMware services.

This is the script…

MS SQL – VC 4.0 / VC 4.1 / VC 5.0

alter table VPX_EVENT_ARG drop constraint FK_VPX_EVENT_ARG_REF_EVENT, FK_VPX_EVENT_ARG_REF_ENTITY
alter table VPX_ENTITY_LAST_EVENT drop constraint FK_VPX_LAST_EVENT_EVENT

truncate table VPX_TASK
truncate table VPX_ENTITY_LAST_EVENT
truncate table VPX_EVENT
truncate table VPX_EVENT_ARG

alter table VPX_EVENT_ARG add
constraint FK_VPX_EVENT_ARG_REF_EVENT foreign key(EVENT_ID) references VPX_EVENT (EVENT_ID) on delete cascade,
constraint FK_VPX_EVENT_ARG_REF_ENTITY foreign key (OBJ_TYPE) references VPX_OBJECT_TYPE (ID)

alter table VPX_ENTITY_LAST_EVENT add
constraint FK_VPX_LAST_EVENT_EVENT foreign key(LAST_EVENT_ID) references VPX_EVENT (EVENT_ID) on delete cascade

This will be worked into a maintenance window for me so be warned I HAVEN’T RUN IT MYSELF YET!  I will update this post when I have completed this and also when I have resolved the pesky hp-ams issue (this may be via an upgrade to 5.5 or resolving in place on 5.1).

As always, comments are welcome!

C

 

 

 

Advertisements

Unable to connect to the MKS: Connection terminated by server on ESXi 5.5

I am seeing this issue on ESXi 5.5 build 1331820 (from the HP custom image) on HP BL460C GEN8 servers.  According to this thread it has been seen by a few people and VMware released a patch for this on December 22nd (four patches in total were released on this day.  Rebooting the hosts or restarting the management agents clears the fault but it appears to come back after a couple of weeks of operation.

The thread also discusses issues with a HP component called AMS but I have not seen an issue with this.  HP have an advisory out for AMS 9.1.0 but this is on ESXi 5.1 or 5.0 not 5.5. My version of AMS is currently 550.9.4.0-29.1198611 (get this from a putty session on the host by executing “esxcli software vib list | more”) so I am above the recommended version (9.2.0 or later) anyway.  The post does mention a last resort method of uninstalling the software which can be found here.

If you do need to update AMS this would be done from the latest HP offline bundle available here.

Other recent updates for the BL460C GEN8 are:

  • ILO4 version 1.32 which correct an issue with ILO incorrectly identifying an overheating condition.  Download the latest version from the ILO4 support page.
  •   A new system ROM dated 14th November (previous version was 18th September).  Download this from the HP product homepage.

The patches above are not in the latest version of the HP Service Pack for ProLiant (SPP) which is currently version 2013.09.0(B) and cuts off at 31st October 2013.  So if you prefer to deploy patches this way you will have to wait for the newest release.

I plan on testing the latest firmware and ESXi patches on production systems soon.  If you are suffering the same issue my advice would be to monitor the thread mentioned at the top of the post to see how other people are getting along 🙂

EDIT – some further updates to the communities thread state that the VMware December 22nd updates for ESXi do not fix the issue.  One poster states that only stopping and starting the hp-ams service allows normal operation to resume.  I have tested this in my own environment and can confirm this also.

To restart hp-ams you don;t have to be in maintenance mode (but please do if you are concerned about risk of the host crashing).  Then putty onto the host and:

/etc/init.d/hp-ams.sh restart

Looks like this issue is between HP and VMware to sort out I would probably expect an update to hp-ams soon?  While we are waiting here is a cheesy video explaining what the service actually does..

http://h30507.www3.hp.com/t5/Coffee-Coaching-HP-and-Microsoft/HP-ProLiant-Gen8-Agentless-Management-Overview/ba-p/108579#.Us6mOfRdW14

C

EDIT 2

Logged this with VMware and they asked me to pass to HP.  Now logged with HP.

Reset a misbehaving blade server ILO from a C7000 Onboard Administrator via PuTTY

A colleague of mine found this one.  We had a HP BL blade server you could not access via ILO for “love nor money”..a quick reset via PuTTY to the Onboard Administrator (OA) saved the day (and re-seating the server in the enclosure).

C

HP ProLiant BL460c GEN8 blade server and CPU Overheating

I have seen this issue recently on GEN8 blades and it looks like the latest ILO code resolves this.  Fortunately it is being incorrectly reported by the ILO and ILO code version 1.32 fixes this!

vCenter/ESXi 5.5 and hardware version 10

Be warned if you want to upgrade your VM hardware to version 10 following an upgrade to vCenter 5.5 and ESXi 5.5.  Once the VM hardware has been upgraded to version 10 (don’t forget to update VM tools first) you will have to manage the settings of the VM using the vSphere Web Client.  And no, you cannot downgrade the hardware version!

Image

ESXTOP, NUMA, virtual sockets and cores. The penny finally drops.

I have recently been watching the Trainsignal/Pluralsight VCAP training series with Jason Nash.

One of the lessons discusses NUMA and gives recommendations on setting virtual sockets and cores correctly within the vSphere client.  Another lesson gives a good overview of using ESXTOP and points the watcher in the direction of a vSphere 5 ESXTOP quick reference poster.  Both were highly educational videos and are well worth watching.

The quick reference poster (created by Andi Lesslhumer) can be found here:

http://www.vmworld.net/wp-content/uploads/2012/05/Esxtop_Troubleshooting_eng.pdf

A good read on NUMA and how to correctly set sockets/cores on VMs can be found here:

https://blogs.vmware.com/vsphere/2013/10/does-corespersocket-affect-performance.html

I am really glad I decided to begin my VCAP journey and look forward to finishing the training videos (and then watching them all over again!).

EDIT A nice overview doc of ESXTOP can be found here:

http://vsandbox.com/wp-content/uploads/2013/06/ESXTOP.pdf

An overview of maximum sockets per O.S. can be found here:

http://blogs.technet.com/b/matthts/archive/2012/10/14/windows-server-sockets-logical-processors-symmetric-multi-threading.aspx

C

ESXi “Bank 6 not a valid bootbank error” on Proliant Gen8 Server

ESXi is installed on an SD card and following an update to the BIOS the server may only see the first 32mb of the card.  This is enough to get the boot going but the boot crashes out pretty quickly.

The answer can be found in this article on the HP site which recommends updating the ILO to firmware revision 1.30.

Note that this version of the firmware seems to break the “Remote Console” link within ILO in IE unless you run in compatibility mode.  Firefox and Chrome appear to be unaffected by this.

Should the rather long link above stop working (come on HP, try harder please) then the document reference appears to be “mmr_kc-0108349”.

The ILO code can also be downloaded from here.

When updating ILO firmware download the code for “Windows” not “VMware” or you end up with some odd zip file format (SCEXE) that only works on LINUX.  Extract the file to get the .bin file and then log into the ILO to upgrade the ILO firmware.

The good news is that following a firmware upgrade ESXi will boot up fine and your installation should be fine.

C

%d bloggers like this: