How to deploy a static website on Amazon S3

Written by Guadalupe Vocos & Pedro Bratti | Cloud & DevOps Engineer @ DinoCloud

You can deploy an Amazon S3 bucket to work as a website with a CloudFront Distribution for Content Delivery and Route 53 for cloud Domain Name System (DNS)

Why choose a static website?

Hosting static websites is becoming more and more popular, but what does it mean to be static? It means that your site consists of a set of “pre-built” files (HTML, js, and CSS files) that are directly served on request. This plus the resources that AWS offers allows us to have a serverless, flexible, scalable, highly performing, secure and low-cost infrastructure.

Before you begin:

As you follow the steps in this example, you will work with the following services:

  • CloudFront: Distribution and Origin Access Identity.
  • Route 53: Hosted Zone and Records.
  • S3: Bucket.

You will need to have these prerequisites before starting the steps:

  • Route 53: Domain Name already registered.
  • Certificate Manager: Certificate requested (Optional in case you want to secure communication through the HTTPS protocol).

Step 1: S3 Bucket with Static Website Hosting

  • Sign in to the AWS Management Console and open the Amazon S3 console at AWS S3.
  • Choose Create Bucket, enter the Bucket Name (for example, medium.dinocloudconsulting.com) and on region select us-east-1 and Create.
  • Now you can upload your index.html to the already created bucket.

Step 2: Route 53 Create Hosted Zone

  • Sign in to the AWS Management Console and open the Amazon Route 53 console at AWS Route 53.
  • Choose Create Hosted Zone, enter the Name, select the Type Public Hosted Zone and Create.

Step 3.1: CloudFront Create and Configure Distribution and OAI.

  • Sign in to the AWS Management Console and open the Amazon CloudFront console at AWS CloudFront.
  • Choose Create Distribution, in the Origin Settings section, for Origin Domain Name, enter the Amazon S3 website endpoint for your bucket (for example, medium.dinocloudconsulting.com.s3-website-us-east-1.amazonaws.com)
  • Under Bucket Access select Yes, use an OAI, Create new OAI and select Yes, update the bucket policy.
  • (Optional) For SSL Certificate, choose Custom SSL Certificate (example.com), and choose the custom certificate that covers the domain
  • Set Alternate Domain Names (CNAMEs) to the root domain. (for example, example.com).
  • In Default Root Object, enter the name of your index document, for example, index.html.
  • Let the rest properties as default and Create.

Step 3.3: CloudFront Configure properties Error page.

  • Select the Distribution, already created, and go to the properties tab called Error page.
  • Set the followings properties:
    • HTTP error code: 403: Forbidden.
    • Customize error response: Yes.
    • Response page path: /index.html
    • HTTP response code: 200: OK

Step 4: Route 53 create a Record.

  • Sign in to the AWS Management Console and open the Amazon Route 53 console at AWS Route 53.
  • On the Hosted Zone which you already have created, select Create Record.
  • Select Record Type: A and under Value field, check Alias and select the CloudFront Distribution domain name.
  • Wait a couple of minutes for the DNS to propagate and search the site on your browser.

All ready! You now have your static website up and running.

At DinoCloud, we take care of turning a company’s current infrastructure into a modern, scalable, high-performance, and low-cost infrastructure capable of meeting your business objectives. If you want more information, optimize how your company organizes and analyzes data, and reduce costs, you can contact us here.

Guadalupe Vocos

Cloud & DevOps Engineer
@DinoCloud

Pedro Bratti

Cloud & DevOps Engineer
@DinoCloud


Social Media:

LinkedIn: https://www.linkedin.com/company/dinocloud
Twitter: https://twitter.com/dinocloud_
Instagram: @dinocloud_
Youtube: https://www.youtube.com/c/DinoCloudConsulting

AWS AppSync + GraphQL

Written by Nicolás Tosolini | Associate Software Engineer @ DinoCloud

What is AppSync, and what is it used for?

AppSync is an AWS service that is responsible for simplifying application development. For example, if you use a front-end development, you have to deploy an API as quickly as possible. In the shortest possible time, AppSync gives you the possibility to deploy the API quickly, with a couple of clicks, and Amazon takes care of the maintenance. It is very efficient for front-end developments since it gives you the possibility of having flexible and secure access to the data with one or more origins. AppSync allows you to take all the data, which comes from different sources, gather it, modify it, and bring only the relevant information, which ends up being an essential feature since it offers more speed and allows the performance to be much more performant.

AWS AppSync
AWS AppSync. Source: AWS

Another quality that AppSync provides is the management and synchronization of application data in real-time. For example, using subscriptions offers the possibility to the server that it is constantly listening to if there are databases or not, which is functional to create chats since it is constantly sending data in real-time. You have precise access to that data.

On the other hand, a feature that AppSync has is that it allows you to access and modify data offline, for example, if you have a mobile application. You run out of data, or you do not have Wi-Fi, it would be expected for you not to continue using it. However, from AppSync, It can continue to work. Once the internet connection is back, it will take care of merging the data from when you were offline, resolving conflicts, and syncing everything to the database. 

What is AWS AppSync?
What is AWS AppSync? Source: AWS

Another clear example is when from enterprise apps, from web apps, mobile apps, or I oT devices, AppSync takes all your information and processes it with GraphQL Scheme or resolvers, being able to see that the source of information can come from any site, be it or not on AWS, which turns out to be valuable if you need information that is in four different databases, as with AppSync you can grab everything and put it in one call.   

AppSync uses GraphQL as its query language, allowing the connection of different information sources, whether they are inside or outside of AWS. It has a high degree of security since it can use Amazon Cognito, IAM or the different possibilities of Amazon authentication, subscriptions to do real-time apps, and serverless caching.   

GraphQL and its functions

GraphQL is simplified data access and queries, which means that the client only consults the data it needs and the format it wants. Searching, filtering and querying data would be three dominant aspects in this language. It provides all the information, filters it, and offers it to you, thus increasing the system’s speed enormously, especially if it is a mobile application and it is not connected to Wi-Fi.

Another feature of GraphQL are subscriptions, which mean updates and access in real-time, which gives you the possibility to run applications instantly, for example, a chat that when you send information, the other user has to see it instantly, giving you the possibility of synchronizing many data at the correct time. 

AppSync is also relevant in caching since it gives you the ability to cache endpoints and resolvers, increasing the response speed.   

What are the benefits of using AppSync?

  • It is Effortless since you can get it up quickly, and you do not have to do server maintenance. If the application grows in size, it allows you to scale very quickly and have the entire AWS infrastructure.  
  • It has the advantage of providing you access offline and in real-time.
  • Moreover, finally, unified access, which means you have resolvers, lambadas, and all services and data in one place. 

Examples of possible applications to create

In a travel application where the user may or may not be connected to the Internet, especially if he is in another country, AppSync allows you to continue using the application since it will not give any error and then when connecting, it will synchronize all the information.

Unified access to data
Unified access to data. Source: AWS https://aws.amazon.com/es/appsync/ 

If it is a social media app, it will use lambda issues and access to different sources of information. Today Facebook has too much information, and this will allow you to access all those information points.  

Real-time collaboration
Real-time collaboration. Source: AWS https://aws.amazon.com/es/appsync/ 

Also, in the case of chat apps, everything happens thanks to the so-called real-time subscriptions, where the user has to have an answer instantly. Thanks to the subscriptions, the request in these services is made much faster and more agile. It is also connected to amazon Cognito, which brings a further step in security.  

Real-time chat application
Real-time chat application. Source: AWS https://aws.amazon.com/es/appsync/ 

What is the difference between GraphQL and Rest API?

GraphQL has the characteristic of having a single endpoint, allowing you to do everything from there. You have the option of giving it a query, mutation, or subscription, and the server will take care of returning that. It also gives you the possibility to bring only the information you want. For example, if you make a query and tell it that you only need one user and give it its ID, but in order that you only need three specific data, it will show you what you asked for and nothing else. Let us also think that having a single endpoint, you only have to make one call, and it brings you everything you need. Sometimes you may have the information stored in different places, so it is necessary to make several calls with Rest API to different endpoints. This brings you information that you do not need, which does not happen with GraphQL, since with its configuration, you ensure that it only shows you the relevant information you want and in the way you want. It is very optimal, especially when handling large volumes of information.       

GraphQL:

  • A query language for APIs.
  • It provides you with a complete and understandable description of your API data.
  • It offers you the possibility of obtaining only the data you need in a single request.

Some definitions:

  • Schemas: facilitates how identities are decided, how they are related to each other, which ones are available to each client. 
  • Query: it performs the queries to the single entry point (endpoint). 
  • Mutation: insert, delete and edit elements. 
  • Subscriptions: This allows real-time connection with the server to be immediately informed about important events. 

GraphQL Query Language

  • You want to read data
  • Mutations change data
  • Subscriptions subscribe to real-time data

Cognito: What is it, and what is it for?

  • Amazon Cognito provides authentication, authorization, and user management for your mobile and web applications. 
  • Users can log in directly with a username and password or through a third party such as Facebook, Amazon, Google or Apple (federation)
  • User groups and entity groups

At DinoCloud, we take care of turning a company’s current infrastructure into a modern, scalable, high-performance, and low-cost infrastructure capable of meeting your business objectives. If you want more information, optimize how your company organizes and analyzes data, and reduce costs, you can contact us here.

Nicolás Tosolini

Associate Software Engineer
@DinoCloud


Social Media:

LinkedIn: https://www.linkedin.com/company/dinocloud
Twitter: https://twitter.com/dinocloud_
Instagram: @dinocloud_
Youtube: https://www.youtube.com/c/DinoCloudConsulting

Data Lake at AWS

Written by Francisco Semino | Lead Solutions Architect @ DinoCloud

What is a Data Lake?

A company has data distributed in different silos (On-Premise databases), making it difficult to obtain information, gather it, and analyze it to make business decisions. Data Lake provides the ability to centralize all that data in one place. This will allow for processing all the data in the Data Lake and then generating statistics and analysis prior to a business decision. You can create charts, dashboards, and visualizations that show us how the company is, the products, and what the customer wants, among many other options, in addition to the ability to apply Machine Learning to predict this information and make decisions based on it.

A Data Lake is a repository where you can enter structured data (such as from databases) and unstructured (from Twitter, logs, etc.) You can also add images, videos (in real-time or recorded). One of the properties of a Data Lake is that it can be scalable up to Exabyte, a considerable amount of information. It does not imply that it is necessary to have many data to have a Data Lake; it does not have minimums or maximums.

It serves both small and large companies. It is because of its low-cost quality: you pay only for what you use. Being a cloud service, it has the advantage that there is no need to pay for storage “just in case”, but that you pay as you go, according to use. As much as if the Data Lake grows 5GB per month or 5TB per month, it will be paid only for that use.

A little history

What is known as Data Warehouse is the traditional Business Intelligence system of the company, one of its properties is that they only allow structured data. It involves much investment because we would have to pay for capacity (since the Data Warehouse has its processing). That is, this was only used in large companies due to the large amounts of investment required.

The Data Warehouse, due to its high costs and that its clusters are for processing as well as much less capacity than a Data Lake could not be scaled to Exabyte.

Although the most significant difference is that in Data Warehouse, the user defines the schema before loading data, that is, you must know and define what is going to be sent before loading it and then be analyzed by another of the tools of Business Intelligence that will show dashboards, visualizations, etc.

It does not mean that the Data Lake will supplant the Data Warehouse, but rather that it comes to complement it in cases where the company or architecture needs it or already owns it and does not want to get rid of it.

Data Warehouse process for further analysis
Data Warehouse process for further analysis.

So then, there are three possible architectures:

  1. That the company already has a Data Warehouse and wants to make a Data Lake. Then it can be done in a complementary way, creating a Data Lake separately and all the data from the Data Warehouse, sending it to the Data Lake and using its tools for Big Data processing, Machine Learning, and other issues; otherwise, it could not apply.
  2. The company does not have a Data Warehouse, one is needed, and a Data Lake because the Business Intelligence tool is to be used. The data engineers only support connections to the Data Warehouse where the data is structured. So what is recommended is to raise the Data Lake and create a separate Data Warehouse where all the data ingestion is done through the first one, in order to be then able to send the information directly to the Data Warehouse already transformed, so that the Business Intelligence tool consume it directly from there. In turn, all the data can be used in Big Data processing and all the tools that Data Lake allows us to use.
  3. Finally, and easier: that only one Data Lake is required. A Data Warehouse would not be needed since the Business Intelligence tool directly supports connections to the Data Lake. You could just lift the Data Lake and do all the Business Intelligence and Big Data processing directly from there.

Data Lake Properties

The most important property is that it does not matter where the information is located in an easy, secure way (it travels encrypted) and low cost. Everything can be migrated to a Data Lake: from Premise, from the cloud, from AWS, etc.

In addition to that, other data movements are obtained, which is if the application works real-time, that is, if it is required to send logs of our application, of Twitter tweets to see what the customer thinks of a product and service, it can be done in real-time and thanks to a lot of AWS services.

What is a Data Lake?
What is a Data Lake? Source: AWS

Another possibility is that a company has streaming videos in real-time and wants the application to continue to function normally, streaming videos in real-time and storing them in a Data Lake to be analyzed in real-time.

Once the data is ingested, the important part begins: analyze it, take advantage of the Data Lake, make business decisions that affect the company, improve it, improve its product, etc. Then there are two main branches: Analytics on the data, that is, show them on the dashboard, modify them, show visualizations, extract the information.

The second branch: Machine Learning, to be able to predict a little information. There are AWS services that allow analyzing Machine Learning, especially to companies that have experts in this subject, and services that allow small or medium-sized companies not to hire an expert in Machine Learning. For example, AWS Comprehend allows you to understand a bit of natural human language and transform that into ideas: understand what specific tweets are saying, know if they are evaluating it positively, negatively, or neutrally, etc. There are services like Recognition to recognize faces or objects in, for example, a live stream. This is a great advantage today because it allows small and medium-sized companies to have a Data Lake and exploit it without significant investment.

We are often asked in DinoCloud: “how long will my DL be up and running?”. The answer would be no more than two weeks, using what is recommended with essential functions initially, exploiting the data a little, seeing what the company needs, and making dashboards, visualizations, and Machine Learning.

Another common query is: “Would the development of a Data Lake affect my Application / Service that is running in the cloud?”. The answer is simply no. They are entirely complementary questions, in parallel. An application can continue to be developed by performing a Data Lake in parallel without disturbing or the performance being low at those moments in the application. It is because requests are not made directly to the database that the application is using. However, they apply Amazon services that allow extracting all that information from a database-type backup, doing it with the Read Replica, for example, without affecting the application and at a low cost.

AWS SERVICES

S3

Where do I keep the data, where do I store it, what would my Data Lake be? The answer is Simple Storage Services (S3). It is a storage of objects in Amazon. It is virtually unlimited, meaning that you can load as many exabytes as you need. It has an availability of 99.99%, which allows us to know that all our data will remain safe there, and any disaster or inconvenience that may occur, the data remains backed up. Being Amazon’s first cloud service, it is pretty polished and has much power, a lot to give, and all Amazon services are integrated with S3. This is the most important “why” of choosing S3 as a data storage for a Data Lake. It is also self-scalable, and it only charges for what it is used; it does not pay more.

Another of its main characteristics is security: you can block the permissions to other users, the only ones who can access this data are Amazon services, and you must pass through them to be able to see the data, in addition to being able to encrypt the data. Information through KMS (Key encryption service). You can also control the properties of the object at the object level itself, being able to make it public, for example, a single file within an entire bucket without having to make the entire bucket public.

S3 Specific properties
S3 specific properties.

One of the essential properties of S3 is the number of services that allow you to enter the data as needed. That is to say, it allows to unify of all the dispersed data (in a cloud, on-premise, etc.) in a Data Lake.

In terms of costs, S3 only charges for what is used and no more. These costs are tied to how frequently the user accesses the data that is in S3. S3 Standard has an estimated price of $ 0.0210 per GB.

S3 Standard IA (Infrequently Accessed Data) is next to S3 Standard. For less frequently accessed data, its price is reduced by almost 40%, and it has the same properties as the S3 standard. It is found in 3 availability zones, and it is available all the time; it has milliseconds of access. However, Amazon charges a small percentage of commission per Giga that is extracted, so each time you want to access the data, it will charge a small commission per object that is being requested.

By way of mention, there is also the S3 One Zone IA, which is the same as the S3 Frequently Access with the difference that it is found in an availability zone, with high availability and is generally used for backups. There are also S3 Glacier services, where access to data takes minutes or hours, and S3 Glacier Deep Archive, where there is a delay of 12 to 48 hours to access. These are used for data accessed once or twice a year, and the cost is extremely cheap.

How is the data ingested in a Data Lake? Here are some Amazon services that can be used to enter data:

  • AWS Direct Connect: allows you to segment and securely send all the data that does not pass through the internet. It is recommended for large amounts of data.
  • Amazon Kinesis: for streaming data and video
  • Amazon Storage Gateway: virtual connection between Amazon and an On-Premise. Allows file transfers safely and with all the properties.
  • Amazon Snowball: commonly used for physical migrations. Scalable up to Terabyte.
  • AWS Transfer for SFTP: raises SFTP servers and can be used through a VPN.

Kinesis

It is a real-time service from Amazon. It is divided into four sub-services:

  • Kinesis Video Stream that streams live videos allows that while the stream pipeline is being maintained, the data can be ingested to S3 in real-time or doing analytics on this video.
  • Amazon Kinesis Data Firehose allows data ingestion in ‘near real time’ to S3, Redshift, etc. If an application is sending events or logs all the time, it allows to ingest the data continuously and in ‘near real time’ to S3, ElasticSearch or Redshift.
  • Amazon Kinesis Data Stream that allows real-time data streaming but is usually used more to send data to applications, directly to an EC2 to be processed, and is responsible for sending it directly to Amazon Kinesis Data Analytics
  • Amazon Kinesis Data Analytics, real-time analytics that allows you to query the data that is passing live.
4 Kinesis sub-services
4 Kinesis sub-services. Source: AWS.

An essential property of Kinesis is that it is Serverless; you pay only for what you use.

AWS Glue

How to consume data from a Data Lake? This answer will begin by talking about AWS Glue. It is an Amazon service with two main parts, Data Catalog, where all the data is cataloged, and all the metadata is obtained and stored there. It allows a Data Lake to be kept organized so that other services can later consume it. It is crucial to have a data catalog. In turn, Amazon Glue has a service called Crawler, which allows the metadata of all the data to be extracted automatically and serverless. A Crawler is created, all metadata is extracted, and you are charged for the minutes it took the Crawler to extract that data. The data store can be S3 or any other storage. This catalog is saved in the Data Catalog part of Amazon Glue, in the form of a database, which shows a table with all the necessary information registered. The formats supported by crawlers are CSV, AVRO, ION, GrokLog, JSON, XML, PARQUET, GLUE PARQUET.

Queries in an Amazon S3 Data Lake
Queries in an Amazon S3 Data Lake

The second part is ETL, significant in the world of Data Lake and Big data, which is the part where all the data is extracted from the Data source, transformed employing a script running in an engine, and then loaded transformed to a target. This does not mean that the Data Source and the Data Target are different, but they can be the same.

Allowed Data Source and Data Target are Amazon S3, RDS, Redshift, and JDBC connections.

AWS Glue Jobs is a service that allows you to run a script on a serverless server. You can add a trigger in this; every time there is a file in S3, a trigger is automatically performed. However, the data must be cataloged to use Job since tables can only be created after being cataloged. For example, if you go from an S3 to a Redshift, the metadata must be present to create Redshift tables. Otherwise, it must be done manually. Then the Job procedure is as follows

  • extract the data,
  • perform a trigger in any way (on-demand or by a specific trigger),
  • extract the data from the source,
  • run a script that transforms the data, and
  • return them to carry.

It is essential to know; it is not necessary to know how to program in Python to run the script because Amazon offers the possibility of specifying the transformations that you want to do and writes the script automatically. If a modification is required, the script is available for modification. It is one of the main advantages of Amazon Glue Jobs.

AWS Athena

Another way to consume data from a data lake is AWS Athena. It is an Amazon service that allows me to query the data with SQL queries directly to S3. It is a serverless service. The queries have a performance to process the data at high speed and with fast configuration. Just go to the Amazon Athena console, indicate what data to analyze, and start writing. However, it is necessary to have the data cataloged, or it can be done by hand. You only pay for scanned data. If 1Gb is explored in a query, it will be charged only for 1Gb.

Amazon Athena allows from anywhere, for example, a business intelligence tool that needs to consume data from S3, make the connection, and perform the S3 query. So the Business Intelligence tool where all the dashboards will be displayed has a connection and processing capacity of bringing the data without the need to move all of these to a Data Warehouse.

Amazon RedShift logo

AWS Elastic Map Reduce

Finally, we will talk about Amazon EMR (Elastic Map Reduce). It is Amazon’s service par excellence in Big Data. It allows to deploy all the applications for all the Open Source frameworks, like Apache Spark, Hadoop, Presto, Hive, and others; it allows you to configure everything in cluster mode. It is self-scalable with high availability. It is vital because there are situations in which a large amount of data needs to be processed at a particular time, so you only charge for that time used, and you save much money. It is a Multi-Availability Zone, and it has data redundancy, and in any situation that happens, everything will remain up and available to the user. It is easy to administer and configure since it does so automatically by going to the console and raising the desired frameworks, indicating the number of nodes required, what types of nodes, and others. Amazon EMR is tightly integrated with Data Lake and all of the services listed above.

After processing all the data and ingesting it, now comes the part that business people are most interested in. The Business Intelligence service is called Amazon QuickSight. It is the first Business Intelligence service that pays per session. In other words, you will only pay each time you enter the QuickSight console, not by users, not by licenses, only by session. There are two types of sessions as in all Business Intelligence: the creator, the user who exploits the data, and the person who views the data to make decisions.

At DinoCloud, we take care of turning a company’s current infrastructure into a modern, scalable, high-performance, and low-cost infrastructure capable of meeting your business objectives. If you want more information, optimize how your company organizes and analyzes data, and reduce costs, you can contact us here.

Francisco Semino

Francisco Semino

Lead Solutions Architect
@DinoCloud


Social Media:

LinkedIn: https://www.linkedin.com/company/dinocloud
Twitter: https://twitter.com/dinocloud_
Instagram: @dinocloud_
Youtube: https://www.youtube.com/c/DinoCloudConsulting

Analyzing Data

Among other things, it will be possible to know what is needed to improve from previous events.

Written by William Díaz Tafur

Data analysis is vital for companies because, from this point on, it will give the answers that the business needs to be able to innovate in any area.

Furthermore, it is that the determinations taken from the data give a very high rate of effectiveness. In this way, it will be possible to know what is needed to improve from previous events, since it is not the same to make a decision blindly or guided by instinct as one taken from data obtained from the previous operation.

To carry out operations.

On the other hand, the data can be used in an application that works automatically in the performance of operations and in which, based on previous situations, it makes the decision itself or in the visualization step, it can be used to that a person look at them and make decisions from them.

Similarly, the hypotheses or theories raised by companies in their business area are validated with the results of the more or less intelligent analysis of the data they already possessed or are beginning to process thanks to data engineering.

Uses and tools

The most common uses are log analysis, e-commerce personalization or recommendation engines, fraud detection and financial reports, among many others.


Moreover, if we refer to tools for data analysis, some depend on the type of analysis needed. The best known are the Apache frameworks for big data, or they can be used on AWS in the EMR service.

Machine Learning

In data analysis, there is also what is known as machine learning techniques, which allow a “machine” to learn from the past data for the analysis of current information.


For example, being a company dedicated to electronic commerce, a machine learning model can be trained so that, given a transaction, it says whether it is fraud or not.


This model, previously trained with the historical transaction data of the business and the more data from the past it has, the more effective it is and, in turn, it learns the more it is used.


Social Media:

LinkedIn: https://www.linkedin.com/company/dinocloud
Twitter: https://twitter.com/dinocloud_
Instagram: @dinocloud_
Youtube: https://www.youtube.com/c/DinoCloudConsulting