Discussion:
cloud-init failing to run user-data, about 1 in 10 times
Jason Price
2018-12-04 03:00:39 UTC
Permalink
Cloud provider AWS. Debian Stretch images (9.5). Cloud-init v0.7.9-2

Built AMI in an ASG, rarely will fail to run the user-data script.
Succeeds about 9 in 10 times, fails the other 1.

cloud-init-output.log ending looks like this when it fails:

Generating public/private ed25519 key pair.
Your identification has been saved in /etc/ssh/ssh_host_ed25519_key.
Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub.
The key fingerprint is:
SHA256:y1DX7UpYgc+hmjyuHgq58H0oHk+ivJctTTCShX2Z+rw ***@ip-172-20-137-71
The key's randomart image is:
+--[ED25519 256]--+
| o o .. |
| . o + ...o |
| o o . .+o.. |
| o + . ..oo. |
| . = ..So. . . |
| . + o=. . . |
|. = *.o.o. . |
|.+.%.E... |
|.+*.=oo. |
+----[SHA256]-----+
Cloud-init v. 0.7.9 running 'modules:config' at Mon, 03 Dec 2018
23:37:42 +0000. Up 9.99 seconds.
2018-12-03 23:37:42,924 - ug_util.py[WARNING]: Format for 'users' key
must be a comma separated string or a dictionary or a list and not
NoneType
Generating locales (this might take a while)...
en_US.UTF-8... done
Generation complete.

The /var/lib/cloud/boot-finished file does NOT get created when it fails.

If I'm parsing /var/lib/cloud/instance/sem/* correctly, it tries the
config_byobu module (?) before it goes off the rails.

cloud-init does download the correct user-data script to
/var/lib/cloud/instance/user-data.txt .

The first two lines of the user-data script are:

#!/bin/bash
echo "I am here" | /usr/bin/tee -a /tmp/uds-logging

that output doesn't show up in cloud-init-output, nor in /tmp/uds-logging

I have captured the cloud-init.log from a failure run. I'm happy to
share it if requested. Interesting difference between a success:

main.py[DEBUG]: Ran 17 modules with 0 failures

and a failure:

main.py[DEBUG]: Ran 12 modules with 0 failures

Thank you for any assistance, and please let me know how I can help.
Jason
Thomas Goirand
2018-12-04 08:06:38 UTC
Permalink
Post by Jason Price
Thank you for any assistance, and please let me know how I can help.
Jason
Hi,

Could you try with the latest cloud-init from Sid?

Cheers,

Thomas Goirand (zigo)
Jason Price
2018-12-04 16:12:07 UTC
Permalink
Post by Thomas Goirand
Post by Jason Price
Thank you for any assistance, and please let me know how I can help.
Could you try with the latest cloud-init from Sid?
Alright, I've begun testing with the new build. I'll respond back if
that helps, but it'll take some time to gain the confidence
Jason Price
2018-12-04 20:48:03 UTC
Permalink
Findings: with an ASG of 20 nodes, one of them failed to run the
user-data when running cloud-init v18.3-3

Entire user-data script for reference:

#!/bin/bash
echo "I was here" | tee -a /tmp/uds-logging

I was also able to run `cloud-init analyze show` on the failing (and
succeeding) nodes. The failing nodes seem to skip entirely the
"modules-final" stage

Successful run:
-----------------------------------------------------------------------------------------------------------------------------------

$ sudo cloud-init analyze show
-- Boot Record 01 --
The total time elapsed since completing an event is printed after the
"@" character.
The time the event takes is printed after the "+" character.

Starting stage: single
Starting stage: init-local
|`->cache invalid in datasource: DataSourceEc2 @18647.94500s +05.06200s
|`->no local data found from DataSourceEc2Local @18653.00900s +00.03200s
Finished stage: (init-local) 05.14900 seconds

Starting stage: init-network
|`->no cache found @18653.56600s +00.00000s
|`->found network data from DataSourceEc2 @18653.57000s +00.12600s
|`->setting up datasource @18653.73800s +00.00000s
|`->reading and applying user-data @18653.74700s +00.00300s
|`->reading and applying vendor-data @18653.75000s +00.00000s
|`->activating datasource @18653.76600s +00.00100s
|`->config-migrator ran successfully @18653.82700s +00.00100s
|`->config-seed_random ran successfully @18653.82800s +00.00100s
|`->config-growpart ran successfully @18653.82900s +00.04800s
|`->config-bootcmd ran successfully @18653.87800s +00.00000s
|`->config-write-files ran successfully @18653.87800s +00.00100s
|`->config-growpart ran successfully @18653.87900s +00.02400s
|`->config-resizefs ran successfully @18653.90300s +00.02000s
|`->config-disk_setup ran successfully @18653.92400s +00.00100s
|`->config-mounts ran successfully @18653.92500s +00.00500s
|`->config-set_hostname ran successfully @18653.93000s +00.00100s
|`->config-update_hostname ran successfully @18653.93100s +00.00100s
|`->config-update_etc_hosts ran successfully @18653.93200s +00.00400s
|`->config-ca-certs ran successfully @18653.93600s +00.00100s
|`->config-rsyslog ran successfully @18653.93700s +00.00100s
|`->config-users-groups ran successfully @18653.93800s +00.00100s
|`->config-ssh ran successfully @18653.94300s +00.22000s
Finished stage: (init-network) 00.60700 seconds

Starting stage: modules-config
|`->config-emit_upstart ran successfully @18655.03900s +00.00100s
|`->config-ssh-import-id ran successfully @18655.04100s +00.00100s
|`->config-locale ran successfully @18655.04200s +00.00100s
|`->config-set-passwords ran successfully @18655.04400s +00.00100s
|`->config-grub-dpkg ran successfully @18655.04500s +00.23700s
|`->config-apt-pipelining ran successfully @18655.28300s +00.00300s
|`->config-apt-configure ran successfully @18655.28600s +00.09200s
|`->config-ntp ran successfully @18655.37900s +00.00200s
|`->config-timezone ran successfully @18655.38100s +00.00200s
|`->config-disable-ec2-metadata ran successfully @18655.38300s +00.00100s
|`->config-runcmd ran successfully @18655.38400s +00.00100s
|`->config-byobu ran successfully @18655.38600s +00.00100s
Finished stage: (modules-config) 00.39000 seconds

Starting stage: modules-final
|`->config-package-update-upgrade-install ran successfully
@18658.53600s +00.00200s
|`->config-fan ran successfully @18658.53800s +00.00100s
|`->config-puppet ran successfully @18658.54000s +00.00100s
|`->config-chef ran successfully @18658.54100s +00.00100s
|`->config-salt-minion ran successfully @18658.54200s +00.00100s
|`->config-mcollective ran successfully @18658.54300s +00.00100s
|`->config-rightscale_userdata ran successfully @18658.54500s +00.00400s
|`->config-scripts-vendor ran successfully @18658.55000s +00.00200s
|`->config-scripts-per-once previously ran @18658.55300s +00.00000s
|`->config-scripts-per-boot ran successfully @18658.55300s +00.00100s
|`->config-scripts-per-instance ran successfully @18658.55400s +00.00100s
|`->config-scripts-user ran successfully @18658.55600s +00.00800s
|`->config-ssh-authkey-fingerprints ran successfully @18658.56500s +00.00900s
|`->config-keys-to-console ran successfully @18658.57500s +00.08700s
|`->config-phone-home ran successfully @18658.66200s +00.00200s
|`->config-final-message ran successfully @18658.66500s +00.00700s
|`->config-power-state-change ran successfully @18658.67300s +00.00100s
Finished stage: (modules-final) 00.24300 seconds

Total Time: 6.38900 seconds

1 boot records analyzed

-----------------------------------------------------------------------------------------------------------------------------------
Failed run:
-----------------------------------------------------------------------------------------------------------------------------------
$ sudo cloud-init analyze show
-- Boot Record 01 --
The total time elapsed since completing an event is printed after the
"@" character.
The time the event takes is printed after the "+" character.

Starting stage: single
Starting stage: init-local
|`->cache invalid in datasource: DataSourceEc2 @18619.79200s +05.04800s
|`->no local data found from DataSourceEc2Local @18624.84200s +00.01800s
Finished stage: (init-local) 05.09500 seconds

Starting stage: init-network
|`->no cache found @18625.41900s +00.00000s
|`->found network data from DataSourceEc2 @18625.42300s +00.11500s
|`->setting up datasource @18625.58000s +00.00000s
|`->reading and applying user-data @18625.59000s +00.00300s
|`->reading and applying vendor-data @18625.59300s +00.00000s
|`->activating datasource @18625.60900s +00.00100s
|`->config-migrator ran successfully @18625.65900s +00.00100s
|`->config-seed_random ran successfully @18625.66000s +00.00100s
|`->config-growpart ran successfully @18625.66100s +00.03700s
|`->config-bootcmd ran successfully @18625.69800s +00.00000s
|`->config-write-files ran successfully @18625.69900s +00.00000s
|`->config-growpart ran successfully @18625.70000s +00.02400s
|`->config-resizefs ran successfully @18625.72400s +00.02000s
|`->config-disk_setup ran successfully @18625.74500s +00.00100s
|`->config-mounts ran successfully @18625.74700s +00.00200s
|`->config-set_hostname ran successfully @18625.75000s +00.00100s
|`->config-update_hostname ran successfully @18625.75100s +00.00100s
|`->config-update_etc_hosts ran successfully @18625.75300s +00.00400s
|`->config-ca-certs ran successfully @18625.75700s +00.00100s
|`->config-rsyslog ran successfully @18625.75800s +00.00000s
|`->config-users-groups ran successfully @18625.75900s +00.00100s
|`->config-ssh ran successfully @18625.76000s +00.20700s
Finished stage: (init-network) 00.55900 seconds

Starting stage: modules-config
|`->config-emit_upstart ran successfully @18626.96000s +00.00000s
|`->config-ssh-import-id ran successfully @18626.96100s +00.00200s
|`->config-locale ran successfully @18626.96300s +00.00200s
|`->config-set-passwords ran successfully @18626.96500s +00.00100s
|`->config-grub-dpkg ran successfully @18626.96600s +00.24200s
|`->config-apt-pipelining ran successfully @18627.20800s +00.00200s
|`->config-apt-configure ran successfully @18627.21200s +00.08000s
|`->config-ntp ran successfully @18627.29200s +00.00100s
|`->config-timezone ran successfully @18627.29400s +00.00100s
|`->config-disable-ec2-metadata ran successfully @18627.29500s +00.00100s
|`->config-runcmd ran successfully @18627.29600s +00.00100s
|`->config-byobu ran successfully @18627.29700s +00.00100s
Finished stage: (modules-config) 00.43600 seconds

Total Time: 6.09000 seconds

1 boot records analyzed
-----------------------------------------------------------------------------------------------------------------------------------

How can I triage this further?

Thank you for your time and assistance;
Jason
Post by Jason Price
Post by Thomas Goirand
Post by Jason Price
Thank you for any assistance, and please let me know how I can help.
Could you try with the latest cloud-init from Sid?
Alright, I've begun testing with the new build. I'll respond back if
that helps, but it'll take some time to gain the confidence
Joshua Powers
2018-12-06 15:42:29 UTC
Permalink
Post by Jason Price
How can I triage this further?
Logs for cloud-init are stored in:

/var/log/cloud-init.log
/var/log/cloud-init-output.log

View those on the failing instance, look for any trace or other error
messages. There is also the 'cloud-init collect-logs' subcommand to
collect and tar up data to send others help triage failures.

If you find it is related to cloud-init, you can ask questions at
#cloud-init on freenode.
Jason Price
2018-12-06 21:37:10 UTC
Permalink
In one failing run, I see three tracebacks:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/ec2_utils.py", line
178, in _get_instance_metadata
response = caller(md_url)
File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line
107, in read_file_or_url
exception_cb=exception_cb)
File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line
301, in readurl
raise excps[-1]
cloudinit.url_helper.UrlError:
HTTPConnectionPool(host='169.254.169.254', port=80): Max retries
exceeded with url: /2009-04-04/dynamic/instance-identity (Caused by
NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection
object at 0x7f1f7d7b70b8>: Failed to establish a new connection:
[Errno 101] Network is unreachable',))


2018-12-05 12:53:25,219 - util.py[DEBUG]: Getting data from <class
'cloudinit.sources.DataSourceEc2.DataSourceEc2Local'> failed
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1992, in subp
env=env, shell=shell)
File "/usr/lib/python3.5/subprocess.py", line 676, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1282, in _execute_child
raise child_exception_type(errno_num, err_msg)
PermissionError: [Errno 13] Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py",
line 523, in find_source
if s.get_data():
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.py",
line 383, in get_data
return super(DataSourceEc2Local, self).get_data()
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py",
line 142, in get_data
return_value = self._get_data()
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceEc2.py",
line 105, in _get_data
with EphemeralDHCPv4(self.fallback_interface):
File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line
46, in __enter__
leases = maybe_perform_dhcp_discovery(self.iface)
File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line
99, in maybe_perform_dhcp_discovery
return dhcp_discovery(dhclient_path, nic, tdir)
File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line
163, in dhcp_discovery
util.subp(cmd, capture=True)
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2000, in subp
stderr="-" if decode else b"-")
cloudinit.util.ProcessExecutionError: Unexpected error while running command.
Command: ['/var/tmp/cloud-init/cloud-init-dhcp-cxj1x1bs/dhclient',
'-1', '-v', '-lf',
'/var/tmp/cloud-init/cloud-init-dhcp-cxj1x1bs/dhcp.leases', '-pf',
'/var/tmp/cloud-init/cloud-init-dhcp-cxj1x1bs/dhclient.pid', 'eth0',
'-sf', '/bin/true']
Exit code: -
Reason: [Errno 13] Permission denied
Stdout: -
Stderr: -
2018-12-05 12:53:25,249 - main.py[DEBUG]: No local datasource found
Post by Joshua Powers
Post by Jason Price
How can I triage this further?
/var/log/cloud-init.log
/var/log/cloud-init-output.log
View those on the failing instance, look for any trace or other error
messages. There is also the 'cloud-init collect-logs' subcommand to
collect and tar up data to send others help triage failures.
If you find it is related to cloud-init, you can ask questions at
#cloud-init on freenode.
Loading...