Luciano Moreira

Luciano Moreira

(0 comments, 154 posts)

This user hasn't shared any profile information

Home page: http://luticm.blogspot.com/

Posts by Luciano Moreira

Análise do livro – Joe Celko’s Complete Guide to NoSQL

Terminei de ler o livro "Joe Celko's Complete Guide to NoSQL" e deixo aqui minhas impressões…

Joe Celko é reconhecido no mundo relacional e faz muito tempo que quero ler um livro dele, então pensei: nada melhor que alguém que conhece bem os bancos de dados relacionais para falar de NoSQL, pois a visão seria mais crítica e realista, e não um devaneio achando que bancos de dados relacionais vão morrer até 2020 ou sei lá quando.

Infelizmente o livro deixa a desejar, principalmente se você está com expectativa bem alta, como era meu caso. Isso se deve ao fato do autor se perder em alguns capítulos, com conteúdo que não agrega ao assunto em questão, colocando exemplos e trechos de código que estão for a do objetivo, deixando outros capítulos bem superficiais.

E se você quer um livro onde vai ver alguns pequenos exemplos do uso do MongoDB, Hadoop, Cassandra, Neo4J, etc., para traçar paralelos, este NÃO é o seu livro. O autor em geral somente cita os produtos, sem mostrar exemplos de uso (ex.: programação ou administração), ou delinear prós e contras de cada um, o que deixa claro a proposta do livro em tratar os conceitos por detrás dos diferentes tipos de bancos de dados.

Pontos fortes:

  • O autor fala de outros modelos de bancos de dados que fogem as 4 tradicionais "categorias" de NoSQL (key-value store, column store, document store e graph).
  • Com a imensa bagagem do Celko ele conhece boa parte do histórico de diversos bancos de dados, o que nos ajuda a entender bem a evolução da tecnologia.
  • Muitas referências dos capítulos são fantásticas, oferecendo um bom guia para continuarmos aprofundando no assunto.
  • Para os temas em que aparentemente o autor tem mais intimidade, ele consegue ilustrar bem como é resolvido o problema no mundo relacional, comparando com outra abordagem.
    • O detalhamento de algumas questões sensíveis, como concorrência, níveis de isolamento e seus efeitos, são tratados com excelente detalhamento. 

Pontos fracos:

  • É comum você parar no meio de um capítulo e se questionar… por que isso está aqui? Bem fora de contexto e que poderia ser apenas uma referência para leitura. 
  • Também fica claro quais são os capítulos em que o autor tem mais experiência e gasta mais tempo, deixando o livro desigual.
    • Ex.: O capítulo "Hierarchical and network databases" é mais detalhado e extenso que o "map-reduce"
  • Não é um livro que mostra grandes exemplos de código NoSQL, mas sim trechos onde mostra como seria resolver com SQL tradicional.
    • Uma pequena curiosidade… a palavra JSON aparece 1 vez no livro. A palavra MongoDB somente em 4 parágrafos diferentes.
  • Provavelmente você espera ver mais detalhes de bancos colunares ou sobre map-reduce do que um capítulo sobre "Biometrics, Fingerprints, and Specialized Databases".

Então estou na dúvida se minha nota seria de duas ou três estrelas, então atribuo 2,5 estrelas (de 5!). Alguns capítulos são muito bons, já outros não valem a pena, pois são rasos até para um overview.

Aqui está o link para o livro na Amazon: http://www.amazon.com/Celkos-Complete-Guide-NoSQL-Non-Relational/dp/0124071929

Confesso que tenho na estante (ou no Kindle) outros livros do Joe Celko que ainda não li e, mesmo decepcionado com este livro, com certeza vou investir meu tempo para ler os outros, pois já tive ótimas referências.

Abraços

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

Agendado: Primeiro treinamento de Oracle

Em Junho/2015 vou fazer meu primeiro treinamento Oracle e claro, sendo um database geek, nem preciso dizer que estou muito animado com a ideia. Então resolvi falar um pouco sobre isso...

Já faz tempo que quero conhecer o Oracle, pois hoje não conheço PN do produto, o que é um quase um pecado capital para um profissional que todo dia estuda para se tornar um melhor arquiteto de dados.

Defendo veementemente que  o conhecimento de outros bancos de dados, entendendo seus pontos fortes, limitações e detalhes de arquitetura, vai fazer com que você não se torne um xiita babaca, sabendo propor soluções elegantes para problemas complexos... Além de ser muito divertido!
Por exemplo: estou conduzindo um benchmark para o armazenamento colunar do DB2 e discutindo com os consultores Nimbus sobre o uso do SQL Server ColumnStore em uma empresa. Comparando as diferentes implementações, já temos uma boa ideia de potenciais dificuldades e como superá-las.

E se continuar somando conhecimento de Oracle, NoSQL, outros modelos de armazenamento e processamento de dados, com certeza no futuro os problemas que terei chance de encarar serão ainda mais desafiadores.

Lembre-se que você está ouvindo isso de um profissional que está a mais de 15 anos com o SQL Server por perto e, mesmo depois de passar os últimos 2 anos dedicando mais tempo para o DB2, se considera hoje um melhor consultor e instrutor SQL Server.

Mas e aí, vou fazer os treinamentos oficiais da Oracle? Nem a pau! No momento é puro preconceito meu, desenvolvido ao longo dos anos conhecendo diferentes treinamentos SQL Server, então vou apostar todas as minhas fichas na Nerv. Vejo muita semelhança nos modelos de treinamento da Nimbus e da Nerv, empresa do Oracle ACE Ricardo Portilho (http://nervinformatica.com.br/instrutores.php), então a escolha foi lógica.

Vou começar logo por um treinamento avançado, o Oracle Performance Diagnostic e Tuning (http://nervinformatica.com.br/opdt.php), pois poderei traçar muitos paralelos com outros SGBDs e potencialmente fazer perguntas interessantes, que em uma turma iniciante seria totalmente fora de contexto.

Depois do curso vou registrar aqui as minhas impressões! E lá vou tentar controlar minha curiosidade, senão tenho certeza que o Portilho vai me expulsar de sala e nunca mais farei outro treinamento... :-)

Abraços

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

SQL Server Além do Conceito – Blog Post Collection

Foi lançado uma coletânea de posts de blogs brasileiros sobre SQL Server, que tem por objetivo compilar alguns artigos escolhidos por cada autor.
A ideia partiu de um grupo que se intitula "SQL Friends" e tem por objetivo ajudar interessados pelo SQL Server, centralizando em um único local um conteúdo bem interessante.

E obviamente o livro é grátis... faça o download!

** O Nogare cedeu um espaço do seu site para publicarmos o livro, que também pode ser acessado diretamente do dropbox: https://www.dropbox.com/s/15bk2vh2cjdrpu5/SQL%20Server%20Al%C3%A9m%20do%20Conceito%20-%20Blog%20Post%20Collection%20-%20Original.pdf?dl=0



Espero que vocês gostem, boa leitura.

Abraços

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

[Fun] Objetivos de uma prova… A maior lista da história da computação.

Esbarrei com isso agora pouco e não pude deixar de dar uma boa gargalhada...

Exame C2030-136: Foundations of IBM Big Data & Analytics Architecture V1 (http://www-03.ibm.com/certify/tests/objC2030-136.shtml)

Se você for na guia de "Objectives", vai encontrar uma pequena lista, que coloco abaixo para registro... Acho que é o material completo de estudo. kkkkk

Vou até jogar isso em um cluster hadoop e implementar alguns map-reduce... BIG DATA FOR THE WIN!!! :-)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Section 1: Big Data & Analytics Benefits and Concepts

Explain Volume, Velocity, Variety and Veracity in relation to BD&A.
With emphasis on the following:

Volume (data at scale) is about rising volumes of data in all of your systems - which presents a challenge for both scaling those systems and also the integration points among them.

Experts predict that the volume of data in the world will grow to 25 Zettabytes in 2020.

Big Data & Analytics shifts the scale from analyzing subsets of data to analyzing all the data available.

Velocity (data in motion) which is the ingesting data and analyzing in-motion opening up the opportunity to compete on speed to insight and speed to act.

Big Data & Analytics shifts from analyzing data after it has landed in a warehouse or mart to analyzing data in motion as it is generated, in real time.

Variety (data in many forms) is about managing many types of data, and understanding and analyzing them in their native form.

80% of all the data created daily is unstructured - videos, images, emails, and social media

Big Data & Analytics shifts from cleansing data before analysis to analyzing the information as is and cleansing only when needed.

Veracity (trustworthiness of data) as complexity of big data rises it becomes harder to establish veracity, which is essential for confident decisions.

According to the IBM GTO study in 2012, by 2015, 80% of all available data will be uncertain and rising uncertainty = declining confidence.

Big Data & Analytics shifts from starting with a hypothesis and testing that against data to exploring all the data to identify correlations.

Define analytics and the various types
With emphasis on the following:

To solve a specific business problem, organizations will need to deploy the right type of analytics to suit each distinct situation.

Descriptive Analytics is the retrospective analysis that provides a rearview mirror view on the business through reporting on what happened and what is currently happening. Business Intelligence (BI) falls under this category.

Predictive Analytics (PA) is the forward-looking analysis that provides an organization a future-looking insights on the business by predicting what is likely to happen and why it's likely to happen.

Prescriptive Analytics Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen. Further, prescriptive analytics suggests decision options on how to take advantage of a future opportunity or mitigate a future risk and shows the implication of each decision option. Prescriptive analytics can continually take in new data to re-predict and re-prescribe, thus automatically improving prediction accuracy and prescribing better decision options. Prescriptive analytics ingests hybrid data, a combination of structured (numbers, categories) and unstructured data (videos, images, sounds, texts), and business rules to predict what lies ahead and to prescribe how to take advantage of this predicted future without compromising other priorities.

Cognitive Analytics allows organizations determine best course of action by using cognitive computing systems that learns and interacts naturally with people to extend what either humans or machine could do on their own. Cognitive computing systems learn and interact naturally with people to extend what either humans or machine could do on their own. They help human experts make better decisions by penetrating the complexity of Big Data.

Describe the role of machine learning on BD&A - Extracting meaningful information from large data sets and processing these large datasets in a reasonable amount of time is challenging. Traditionally, data has always been dominated by trial-and-error analysis and this approach has becomes impossible when datasets are large and heterogeneous. Machine learning will enable cognitive systems to learn, reason and engage with us in a more natural and personalized way. These systems will get smarter and more customized through interactions with data, devices and people. They will help us take on what may have been seen as unsolvable problems by using all the information that surrounds us and bringing the right insight or suggestion to our fingertips right when it's most needed. Machine learning offers a solution to this problem by emphasizing on real-time and highly scalable predictive analytics using fast and efficient algorithms for real-time processing of data. Examples of business use cases that benefits from machine learning techniques include churn prevention, customer segmentation, fraud detection and product recommendations.

Difference PA vs BI: PA provides more future-looking answers and recommendations to questions that cannot be addressed at all by BI. PA + BI delivers significantly higher returns than traditional BI implementations that are not predictive in nature.

Describe the value of analytics to support business decisions
With emphasis on the following:

Describe the role of interactive analysis and reporting

Business analytics enables organizations to turn information into better outcomes. To achieve better outcomes, Decision makers need to make smarter decisions, based upon having answers to the following: How are we doing? Why is the trend occurring? What should we be doing? Analytics empowers users at every level of the organization, in every role, to make better decisions. It come down to a desire to make more business decisions based on actual facts rather than gut instinct. Business users need tools that allow them to sort data any way they like to derive additional insights. 

-Providing a sophisticated, yet simple to use interactive analysis and reporting tools into the hands of every business user drives creativity and inspire a culture of evidence-based decision making. It empowers users at every level of the organization, in every role, to make better decisions. It come down to a desire to make more business decisions based on actual facts rather than gut instinct. Business users need tools that allow them to sort data any way they like to derive additional insights with minimal IT reliance.

-Self-service predictive analytics puts the power of predictive modeling in the hands of business users. Using predictive models, users identify patterns based on what has happened in the past, and use them to predict what is likely to happen in the future. For example, you can use a model to predict which customers are least likely to churn, or most likely to respond to a particular offer, based on characteristics such as income, age, and the organizations and memberships they subscribe to. The resulting predictions can be used to generate lists of target customers or cases of interest, as input for strategic planning, or can beintegrated with rules in the context of a predictive application.

Explain how Big Data & Analytics are interlocked
With emphasis on the following:

Data: Data is both unstructured to structured. In general, data does not need to be processed by Extract, Transform, and Load(ETL) to be useful.

Variety: Data comes for multiple sources, across different domains. Care must be taken to avoid aligning data with specific applications early in workflow to allow data to be used in other applications

Association: Correlation criteria ranges from simple to complex. Solutions will employ

Provisioning: Resources are provisioned based on consumer demand. Expect IT to be responsible for allocation and initial data staging
- Benefits:
- Reduced IT cycle time and management cost.
- Extended range of sources to meet demand.
- Storage allocated and expanded to meet demand.
- Processing capacity added and removed to meet demand.
- Processing Expand process capacity to meet demand. 
- Personas: IT and line of business users are jointly responsible for finding and initially preparing data.

Governance: Newly provisioned data must adhere to corporate and data governance policies. Mechanisms must be created to share policy decisions and to track compliance.

Analytics: Services and tools that augment data sources define the analytics that add value and understanding to new data.

Schema: Data is generally organized in columns so it is easily consumed by analytic tools. Unstructured data is also used but is restricted to search and tools use MPP like Hadoop. Data should not necessarily be normalized for use in a specific application but be provisioned for more general access and analysis.

Visualizations: Data is often highlighted using advanced visualizations and graphics where traditional charts do not offer enough insight into overall dataset.

Accuracy: Provide results that are relevant and statistically sound. Tools must highlight areas of concern where assumption may be incomplete or statistically invalid.

Range ofusers: Tools must allow business users to use analytics and statistics without insisting these user be domain experts.

Compatibility: Results that are compliant with existing reporting and application frameworks.

Persuade: Use results to prove a hypothesis and to persuade.

Related Information : Extend knowledge to related concepts and other sources of data.

Explain the various data preparation processes
With emphasis on the following:

Explain methods used to transform the data for analytics

One of the main functions of an ETL tool is to transform structured data. The transformation step is the most vital stage of building a structured data warehouse. Here are the major transformation types in ETL:

- Format revision. Fields can contain numeric and text data types. If they do, you need to standardize and change the data type to text to provide values that could be correctly perceived by the users. The length of fields can also be different and you can standardize it.

- Decoding of fields. In multiple source systems, the same data items are described by a variety of field values. Also, many legacy systems are notorious for using cryptic codes to represent business values. This ETL transformation type changes codes into values that make sense to the end-users.

- Calculated and derived values. Sometimes you have to calculate the total cost and the profit margin before data can be stored in the data warehouse, which is an example of the calculated value. You may also want to store customer's age separately–that would be an example of the derived value.

- Splitting single fields. The first name, middle name, and last name, as well as some other values, were stored as a large text in a single field in the earlier legacy systems. You need to store individual components of names and addresses in separate fields in your data repository to improve the operating performance by indexing and analyzing individual components.

- Merging of information. This type of data transformation in ETL does not literally mean the merging of several fields to create a single field. In this case, merging of information stands for establishing the connection between different fields, such as product price, description, package types, and viewing these fields as a single entity. 
- Character set conversion
- Unit of measurement conversion
- Date/Time conversion
- Summarization
- Key restructuring
- De-duplication

Transformation of structured data also varies by where it occurs. Data can be transformed in the source system before it is moved, in an ETL engine, or in the target system after it lands ELT.

Some type of feature extraction must be applied to unstructured data to convert it to structure data before applying the above kinds of transformations.

Explain methods used to clean the data for analytics.

Data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate data from a data set.

Data validation typically applies data cleansing methods interactively as data is being entered. Other forms of data cleansing are typically performed in larger batches.

Care must be taken when modifying data to be used for analytics that the modifications do not affect the analytic results. Enough cleansing must be done to permit analytics, but less change is better.

Error detection may be strict or fuzzy. Strict detection may match against permitted legal values, or apply simple algorithms like regular expressions. Fuzzy detection applies statistical techniques to identify errors. Duplicates may be detected within a data set using strict or fuzzy matching.

Error correction may include standardization of values, converting formats, enhancing data by merging with additional data, filtering out incomplete records, and merging duplicate records.

High-quality data needs to pass a set of quality criteria. Those include:
- Validity: The degree to which the measures conform to defined business rules or constraints. Data constraints fall into the following categories: 

- Data-Type Constraints - e.g., values in a particular column must be of a particular datatype, e.g., boolean, numeric (integer or real), date, etc.

- Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.

- Mandatory Constraints: Certain columns cannot be empty.

- Unique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.

- Set-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person's gender may be Female, Male or Unknown (not recorded).

- Foreign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values. For example, in a US taxpayer database, the "state" column is required to belong to one of the US's defined states or territories: the set of permissible states/territories is recorded in a separate States table. The term foreign key is borrowed from relational database terminology.

- Regular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999-9999.

- Cross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient's date of discharge from hospital cannot be earlier than the date of admission.

- Decleansing is detecting errors and syntactically removing them for better programming.

- Accuracy: The degree of conformity of a measure to a standard or a true value.

- Completeness: The degree to which all required measures are known. 

- Consistency: The degree to which a set of measures are equivalent in across systems.

- Uniformity: The degree to which a set data measures are specified using the same units of measure in all systems.

Data Preparation: Data is prepared to balance flexibility, timeliness, accuracy and size. ETL and Big Data are often driven by different requirements.
- ETL
- Volume: Dataset size is driven by application requirements. Datamarts contain summary or snapshot data.
- Velocity: Data is staged until it can be filtered and added to the appropriate data warehouse.
- Variety: Key structures normalized to follow First Normal Form (1NF) to Third Normal Form (3NF) schema
- Veracity: Data is trustworthy within the domain of the application for which it is designed.
- Big Data
- Volume: Dataset size bounded by technology and cost constraints. Virtual datamarts contain dynamic information.
- Velocity: Data is captured for both real time and post processing as required.
- Variety: Key structures are denormalized. Data is generally inconsistent when viewed across application domains.
- Veracity: Data is trustworthy when it shown to be consistent and accurate.

Describe the trade-off of transforming, cleaning or aligning of data for analytics

Tradeoffs: Data becomes more accurate within the domain of an applications as it processed and cleaned. At the same time it can become less flexible and less accurate in different applications. Data is more readily consumed by analytics when it is formatted in columns.
- Variety: Data comes from multiple sources, across different domain.
- Generality: Less processed data potentially is more flexible but less consistent.
- Accuracy: Highly processed data is potentially more accurate in predefined application domains; less so across domains.
- Conformance: Data that is aligned with existing reporting structures is typically easier to use and more relevant. Data alignment can reduce consumability in different domains.
- Consistency: Removal of duplicate and missing values increases accuracy within targeted application domain.
- Schema and Cleaning: Data structures in 1NF to 3NF increase relevance within targeted application domain.

Describe the four steps for preparing data that is "good enough" for analytics

These are "Data Wrangling" steps for a data reservoir:
- Provision storage. Consider cloud for elastic environments. 
- Create columns that classify and distinguish incoming data. Use date/time and geolocation when nothing else fits.
- Clean potential keys columns to have most consistent values possible across largest number of domains (non-trivial). Remove duplicates, normalize aliases, to achieve something close to 1NF.
- Apply analytics to add value to the source data set. Update schema and global catalog where appropriate.

Describe the need and use for geospatial and temporal filtering in BD preparation.

steps for preparing data that is "good enough" for analytics

Describe the need and use for geospacial and temporal filtering in BD preparation

Filtering by an interval of time can reduce the amount of data to be considered for analytic purposes.

Filtering by a geospacial region can reduce the amount of data to be considered for analytic purposes.

These 2 filters are commonly applied as a first step to reduce large data volumes.

Explain specific data preparation techniques for creating structured data
With emphasis on the following:

Define text analytics

refers to the process of deriving high-quality information from text. Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

Describe the benefits of text analytics

Text analytics applies analytic techniques to unstructured text as may be found in documents or within text fields in structured data

Text analytics on survey text lets you transform unstructured survey text into quantitative data to gain insight using sentiment analysis.

Text analytics on social media content can be used to identify buzz (level of interest in a topic) and sentiment (positive or negative feelings toward a topic). Social media content can also augment customer profiles.

Social media analytics can be used to predict customer behavior and create targeted marketing campaigns

Text analytics can be used to filter or categorize documents, such as sorting email into folders or identifying spam

NLP can provide the following benefits from text analytics:

- Provides an easy to use environment for capturing the knowledge of your business domain experts into dictionaries and semantic rules for re-use. 

- Allows customizable Information Extraction for logical reasoning to draw inferences from natural, unstructured communications. 

- Offers Entity & Relationship Recognition to classify words or phrases into categories that can be analyzed for business meaning.

Explain how to extract features from unstructured data to provide input for analytics

Analytic tools and algorithms require input to be presented in specific statistical data types: binary (yes/no or true/false), categorical (arbitrary labels like blood type or gender), ordinal (relative score), binomial (number of successes out of possible), count (number of items in a given space or time), and real numbers.

Feature extraction is any process for converting unstructured data into statistical data types. Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. Some examples:

- Feature extraction on text includes word stemming to standardize words into root words, bag of words to remove sequence information and compare documents in a vector space model, counting word occurrences, regular expressions for detecting simple patterns in sequences, search indexing, and phonetic indexing algorithms like Soundex or NYSIS.

- Feature extractions on image data includes histograms, edge detection, blob detection, template matching, motion detection, visual flow, optical character recognition.

- Feature extraction on machine log data may include pattern recognition, standardization to a common format, classification, correlation (collecting messages from different systems that belong to a single event), and various filtering techniques.

Describe the general sources of data for BD&A
With emphasis on the following:

Big data requires a new view on business intelligence, data management, governance, and delivery. Traditional IT landscapes are becoming too expensive to scale and maintain in order to meet the continuous exponential growth of data, both structured and unstructured. Data truly is driving the need for innovative and more cost effective solutions.

The availability of new and more data opens up opportunities for gaining more in depth and accurate insights on operations, customers, weather, traffic, and so on. It is important to keep in mind that Big Data & Analytic solutions deal not only with emerging data (or new data sources). A truly cohesive solution considers traditional data as well as this emerging data, often with an objective of having each enhance the understanding of the other.

Data volumes and data generation speed are significantly higher than it has been before. All this new kinds of data require new set of technology to store, process and to make sense of data. Data Sources. About 80% of data available to an organization is unstructured, and there are many more new types of data coming from many sources such as social media posts (twitter, Facebook), surveillance cameras, digital pictures, call center logs, climate data, and many others. There are also other types of structured data from sensor devices, smart meter, click streams data and others.

Big data is typically identified as having one or more of the following attributes: volume, variety, velocity, and veracity. Typical types of source data can be broken down into the following categories, each of which may demonstrate any of the four V's.
Type Description Examples
Machine and sensor data. An increasingly sensor-enabled, instrumented, and connected world generates huge volumes of data with machine speed characteristics. Machine and sensor data covers all data generated by machines. Radio Frequency ID Data generated from servers, applications, networks, etc.
Image and video Digital images and videos Security surveillance cameras, Smart phones, Web Images/Videos
Enterprise content An organization's documents, and other content, that relate to the organization's processes. Documents, Forms, Checks
Transaction and application data Typically, structured data that describes an event and generally recognized as a traditional source. Point of sale transaction dataData entered by user via a web formData entered into a CRM application
Social data An expression of social media (user-generated content on the internet) in a computer-readable format that also includes metadata providing context (location, engagement, links). Focused strictly on publicly available data. TwitterFacebookYouTube
Third-party data Data obtained under license from third party organizations. Typically structured and recognized as a traditional source. Experian, D&B

Section 2: Big Data & Analytics Design Principles

Explain when it is appropriate to use Hadoop to support the BD&A use case
With emphasis on the following:

Exploration, Landing and Archive - Big Data Repositories:

New and economical generation of technology emerged to enable organizations to explore and apply analytics and extract value from any kinds of data. The open source Hadoop became the choice of technology to store the all the data (big data).

There are some limitations or drawback on a Hadoop database, it often compromise consistency in favor of availability. The current offerings lack of full ACID transaction support. Also, as tools are still evolving, there is l a need to do some complex coding to perform certain types of data processing, but the platform offer many benefits to enable organizations to improve efficiency to deliver analytics.

Hadoop is based on Massive Parallel Processing (MPP) architecture it's possible to scale a Hadoop cluster to hundreds (and even thousands) of nodes. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing. In addition to cost and scalability, Hadoop also brings other benefits such as high availability and data redundancy.

Landing

Landing refers to a landing or provisioning area. In the case of Big Data & Analytics this landing area may support data that is not yet modeled and/or unstructured data. A key consideration is to utilize a cost efficient platform. Data can be landed in its original format (raw state) and leveraged both for a historical record of its original state and to support exploration activities.

The open source Hadoop and commodity hardware become the preferred choice as it can scale as needed and it offers as a cost effective solution to store and process large amounts of structure and unstructured data. A Hadoop database provides a mechanism for storage and retrieval of data that is not necessary modeled in traditional tabular relations as used in relational databases.

One of the key benefits of the Hadoop system is the fact that the data can be stored into a Hadoop File System (HDFS) before a schema is defined. This make very simple to organizations to collect any kind of data (structured and unstructured) and land into a Hadoop system for further investigation and processing. After data is loaded, a schema can be applied and it provides SQL like capabilities to allow data interaction and exploration using traditional BI and Reporting tools.

With evolution of the data integration tools, the Hadoop platform also became an attractive cost effective platform to support part of the data transformation process. This allows organizations to offload such workloads from expensive platforms such as a Data Warehouse.

Exploration

The exploration is one of the key elements Next generation architecture for delivering information & insights.

Exploration refers to a system supporting exploratory querying and analytics of detailed data (not aggregated), often in the raw form. While it may have a SQL based interface, it is not restricted to structured data. Activities supported here would include clustering, segmentation, forecasting, predictive analytics, and others that require large amounts of data. And typical users would include business analytic users and/or data scientist.

In this new architecture the Exploration Repository has much more flexibility and can provide much more agility for the end users to quickly obtain insights on new sets of information that was not available in Data Warehouse & Data Mart. New Big Data and Analytics tools allow users to access structured and unstructured data for exploratory analytics, identify interesting correlations and insights. Particularly leveraging the detailed data to support the needs of exploratory analysis while pushing-down the processing to execute as much as possible in this area directly, without needing to duplicate and move the data elsewhere for exploratory use.

Typical tools used are R, SAS, and SPSS. The data here may include replicated data pulled from operational systems, third party and public sources. Exploration activities imply that the answer or the question is not typically known. Thus, a modeled environment is ill suited for these activities. Rather, a schema-less environment allows users to access data in its original form quickly, manipulate it in various ways, and use all available data.

The cost and flexibility of this platform is ideal to store longer history of data (data that typically was stored on the Warehouse and Data Marts). Basically, Hadoop database became a system of record area and fulfills the needs of the power users (data scientists) that need to perform exploratory analytics on deep history (several years) and raw data.

Archive

While the need to archive "cold" data from data warehousing environments (to reduce costs, improve performance) is as necessary as ever, customer are also requiring to have the ability to query this data. Similarly, while this data may be of no interest for operational reporting or business intelligence, it may still be of relevance for a small set of users performing exploratory or deep analytics

As the Hadoop infrastructure provides a better cost effective platform, the data that is no longer required to be active can be archived for legal reasons. This allows in the future that this data could be accessed without the need to procure new hardware to restore from other magnetic devices.

Explain when it is appropriate to use data streaming to support the BD&A use case
With emphasis on the following:

When streaming data is available, it can either be analyzed in real time as it arrives or stored for later analysis.

Analyzing streaming data in real time can reduce the time to deliver results.

Analyzing streaming data in real time can reduce the storage required by eliminating the need to collect data in a repository before analyzing it.

Techniques that can be applied to streaming data include summarization by creating metadata from unstructured data, aggregating data to create averages, filtering and other data reduction techniques.

Streaming data from multiple sources can be correlated and merged into a single stream.

Streaming data includes audio data, video data, sensor data, geospatial data, telemetry, telematics, and machine log data. The internet of things is generating an increasing volume of streaming data.

Sample use cases are:

Real time fraud detection, cyber security, telematics, network load optimization, real time contextual marketing campaign

Explain when it is appropriate to leverage data streaming and Hadoop for data integration Extract, Transform, and Load (ETL).
With emphasis on the following:

Hadoop is designed to use low-cost commodity hardware to provide a data repository. Using Hadoop as a landing zone for streaming data can provide a cost effective repository for collecting streaming data for later exploration and analysis.

Other options for landing streaming data include relational database or file storage systems.

Describe when to use analytics on data-in-motion vs analytics on data-at rest to support a business use case
With emphasis on the following:

Analytics on Data at Rest

The data at rest consist of physically store data that is considered more static. This data support analytics for use cases where decision frequency is not needed in real-time

Data at rest provides historical data for analysis.

Analytics on data-in-motion - Real-Time Analytical Processing

The data-in-motion consist of leveraging high-speed and highly scalable computing systems to perform analytics on streaming data that is temporary persisted in-memory.

Applying analytics to data in motion supports real time decision-making, adapting to business environments, and helping customers when it really matters - now. It involves ingesting data at varying volumes and velocities, classifying and filtering that data, applying appropriate analytics to it, and providing the appropriate information immediately so that proper action may be taken (either by a person or automatically). Data may be coming from social media, machine data, log data, or other.

The analytics applied in real time are the same models that can be applied to batch data analysis. The models are based on batch data collected. Streaming data eventually will be (perhaps) landed to disk and be able to be used to update or fine tune models. This creates a continuous learning cycle. With each new interaction an organization learns more about their customer that can then be applied to future interactions. In this way, real-time processing works in coordination with the broader big data and analytics architecture. It acts as a method of ingesting data to persist in the environment. It acts as a method to apply analytics in the moment for immediate results.

Real time analytical processing (RTAP) analyzes massive data volumes quickly (in real time) and turn data into Insight to be used to make better decisions. Data can be quickly ingested, analyzed, and correlated as it arrives from thousands of real-time sources (sensor devices, machine data, call detail records, etc). Insights resulted from this processed are turned into actions for automated decision management.

There are many applications of RTAP for almost any industry, here are just few examples to illustrate the benefits of real-time processing and analytics.

Alerting solutions:

The RTAP application notifies the user(s) that the analysis has identified a situation (based on a set of rules or process models) has occurred and provides options and recommendations for appropriate actions. Alerts are useful in situations where the application should not be automatically modifying the process or automatically taking action. They are also effective in situations where the action to be taken is outside the scope of influence of the RTAP application.

Example of alerting application: A patient monitoring application would alert a nurse to take a particular action, such as administering additional medicine.

Feedback applications:

The RTAP application identifies that a situation (based on a set of rules or process models) as occurred and makes the appropriate modifications to the processes to prevent further problems or to correct the problems that have already occurred.

Feedback analysis is useful in, for example, manufacturing scenarios where the application as determined that defective items have been produced and takes action to modify components to prevent further defects.

As an example: A manufacturer of plastic containers might run an application that uses the data from sensors on the production line to check the quality of the items through the manufacturing cycle. If defective items are sensed, the application generates instructions to the blending devices to adjust the ingredients to prevent further defects from occurring.

Detecting system or application failures :

The RTAP application is designed to notice when a data source does not respond or generate data in a prescribed period of time.

Failure detection is useful in determining system failure in remote locations or problems in communication networks.

Examples: An administrator for a critical communications network deploys an application to continuously test that the network is delivering an adequate response time. When the application determines that the peed drops below certain level or is not responding at all, it alerts the administrator.

Use the CAP theorem to choose an optimal data storage technology.
With emphasis on the following:

Brewer's CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees

Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it was successful or failed)

Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Relational database systems provide consistency and availability.

NOSQL systems like Cloudant and CouchDB provide availability and partition tolerance .

NOSQL systems like Hadoop Hbase, Google BigTable, and MongoDB provide consistence and partition tolerance .

Describe the considerations of security on BD&A.
With emphasis on the following:

Explain what are the common focused areas across a security framework

People: Manage and extend enterprise identity context across all security domains with end-to-end Identity Intelligence.

Data: Enterprise-wide solutions for assuring the privacy and Integrity of trusted Information in your data center

Applications: Reducing the costs of developing secure applications and assuring the privacy and Integrity of trusted information.

Infrastructure:
-Endpoint and Server: Ensuring endpoints, servers, and mobile devise remain complaint, updated, and protected against todays threats.
-Network: Guard against sophisticated attacks using an Advanced Threat Protection Platform with Insight into users, content and applications.

Security Intelligence and Analytics: Helping our customers optimize security with additional context, automation and integration.

Describe the data types in the context of data security and complianceWhen talking about Data we distinguish between:

Static Data - data at rest, on disks, tapes and in data repositories (databases, data warehouses, etc)

Dynamic Data - data in motion, as it is being extracted/used by individuals or applications

Meta Data - Data about data, configuration, the settings and vulnerability of the repository itself.

2.7.3. Describe dynamic data security best practices

Create a secure, detailed, verifiable audit trail of all database activities
-User activity, including privileged users
-User creation and object creation and manipulation

Gain visibility into all database activity involving sensitive data
-Who, what, when and how
-Real-time alerts for suspicious activity

Integrate with business processes for audit compliance
-Dissemination of reports to appropriate personnel for signoff and review
-Retain reports and signoffs per audit requirements

Cross-platform, common solution for the enterprise.

Describe Static Data security best practices

All data sources potentially contain sensitive information

Data is distributed as needed throughout the cluster by the Big Data application

Deploy encryption agents to all systems hosting Data Stores

Agents protect the data store at the file system or volume level

Big Data governance automation solutions need to support security and privacy in a manner optimal to the speed and quality objectives of the organization.

The solutions should integrate with corporate glossary, data definitions, and blueprints to align with broader corporate governance processes to ensure accuracy.

The solution should be able to explore and profile data, explore its lineage and relationships, and classify sensitive data.

Data Architects need to De-identify sensitive data within the warehouse and apply obfuscation techniques to both structured and unstructured data while maintaining effective alignment with their analytics objectives.

IT security needs to monitory the warehouses and gain real-time alerts with a centralized reporting of audit data while preventing data breaches.

IT needs the system to automate change management in response to policy, resource and environmental changes.

Enterprise firms need this to scale to large platforms like System z.

Security officers will need the system to integrate with their SEIM platform for a cohesive enterprise dashboard.

Describe the primary prerequisites for predictive analytics.
With emphasis on the following:
1.8. @List prerequisites for predictive analytics:

Data sufficient to train a predictive model for the predicative goal

A business requirement for predictive analytics, such as one of the business applications listed in this article (i.e., a way a predictive model can and will be used, rather than just being a nifty model that may not provide business value); management buy-in for the integration and deployment of predictive scores

Buy-in from business users' community of a predictive analytics initiative

Management sponsorship

Historic data that adequately captures problem domain

Propose alternate outcomes and determine rules that model associated behaviors.

Explain the role of SQL for BD&A
With emphasis on the following:

SQL on Hadoop to support Big Data & Analytics Applications

Hive
- Hive is a data warehouse solution that has a thin SQL-like querying language called HiveQL. This language is used for querying data, and it saves you from writing native MR processing to get your data out. Since you already know SQL, Hive is a good solution since it enables you to take advantage of your SQL knowledge to get data in and out of Apache Hadoop. One limitation of the Hive approach, though, is that it makes use of the append-only nature of HDFS to provide storage. This means that it is phenomenally easy to get the data in, but you cannot update it. Hive is not a database but a data warehouse with convenient SQL querying built on top of it. Despite the convenient interface, particularly on very large datasets, the fact that the query time required to process requests is so large means that jobs are submitted and results accessed when available. This means that the information is not interactively available.

HBASE
- HBase, by comparison, is a key-value (NoSQL) data store that enables you to write, update, and read data randomly, just like any other database. But it's not SQL. HBase enables you to make use of Hadoop in a more traditional real-time fashion than would normally be possible with the Hadoop architecture. Processing and querying data is more complex with HBase, but you can combine the HBase structure with Hive to get an SQL-like interface. HBase can be really practical as part of a solution that adds the data, processes it, summarizes it through MR, and stores the output for use in future processing.

In short, think of Hive as an append-only SQL database and HBase as a more typical read-write NoSQL data store. Hive is useful for SQL integration if you want to store long-term data to be processed and summarized and loaded back. Hive's major limitation is query speed. When dealing with billions of rows, there is no live querying of the data that would be fast enough for any interactive interface to the data. For example, with data logging, the quantities of data can be huge, but what you often need is quick, flexible querying on either summarized or extreme data (i.e., faults and failures). HBase is useful when what you want is to store large volumes of flexible data and query that information, but you might want only smaller datasets to work with. Hence, you might export data that simultaneously: Needs to be kept "whole," such as sales or financial data May change over time Also needs to be queried HBase can then be combined with traditional SQL or Hive to allow snapshots, ranges, or aggregate data to be queried.

SQL on Relational Databases (RDBMS) to support Big Data & Analytics

SQL on Relational Databases is more appropriate for transactional and Operational Analytical Workloads.

Transactional Applications:
- Workloads that drive a lot of inserts, updates and deletes
- Workloads that generate random access on disk 
- Workloads that need to guarantee transaction integrity.
- Workloads with large concurrency (hundreds/thousands of transactions per seconds).
- Operational Analytical :
- Analytical workloads in which high query performance is a key requirement.
- Analytical workloads that requires the transactional characteristics of a relational database (transaction integrity, high concurrency levels, etc).

Compare columnar and row oriented storage for RDBMS .
With emphasis on the following:

Columnar database stores data in columns instead of rows. The major difference between the traditional row oriented databases to the column oriented databases is with the performance, storage requirements and modifying the schema.

Columnar databases are a great option in case your database table has lots of columns and you require query for a small number of them. The column database is designed to proficiently to write and read data from hard disk storage which can speed up the time to return a query. One of the major benefits of a columnar database is that it helps in compressing the data greatly which makes operations very fast. Also, column oriented database is a self indexing that makes use of less disk space than a row oriented database system. Columnar databases method is especially important in data warehousing domain which deals with large volume of data. In data warehouse domain large amount of complex data is loaded, transformed and accumulated which can be easily done by using Column-oriented database system.

Row oriented dataset is ideal when many columns of a single row are required at the same time and when row-size is relatively small, as the entire row can be retrieved with a single disk read. Row oriented databases are well suited for OLTP workloads which are more heavily loaded with interactive transactions

Describe when to leverage "Information Virtualization" on BD&A
With emphasis on the following:

Explain what is "Information Virtualization".

Information virtualization provides views over stored information that has been adapted to suit the needs of systems and processes that use the information. Information virtualization enables information to be stored in various locations but managed, viewed, and used as a single collection. It maximizes the sharing of information, while ensuring individual teams receive information that is customized to their needs.

Information virtualization has two layers of services:
- Information delivery: Information Locator, Search & Navigation, User Interfaces & Reports, Information Services & application program interfaces(APIs).
- Information provisioning: Cache, Consolidation, Federation, Replication.

Describe the information virtualization capabilities.

Information delivery.
- Consumer focused access points include ser interfaces, services, and APIs that provide information to both systems and people. This information can be the business information itself or descriptions and statistics about the business information, which is called metadata. Metadata is used to locate the correct information to use.

Information provisioning.
- Authoritative information sources are made available to the access points using the most appropriate provisioning mechanisms. The provisioning mechanisms can be the following items: 

- Caching provides read-only, in-memory local copies of data that provide fast access for situations where the same query is issued repeatedly on slowly changing information. 

- Federation is real-time extraction and merging of information from a selection of sources providing the most current (but not necessarily most consistent) information. 

- Consolidation makes use of a pre-constructed source of information fed from multiple sources. This approach provides consistent and complete local information (although it might not be the latest available). 

- Replication is an exact local copy of remotely sourced information, which provides locally stored, read-only access to information that is maintained elsewhere. 

- The choice of provisioning method is determined by the needs of the systems and processes that use the information. 

Describe the various ways to index Big Data including.
With emphasis on the following:

Search: Search provides access to unmodeled structured and unstructured data using keywords.

Variety: Data comes for multiple sources, across different domains.

Ease of Use: Data related to keywords is easily found using familiar search interface.

Association: Keywords and results can be associated to metadata that provides easy association with enterprise reporting and analytics.

Indexing: Search indexes enable a search engine to retrieve ordered results in a fraction of the time needed to access the original content with each query.

Indexing: Full-text indexes provide access to large amounts of search data with results provided in milliseconds
- Indexing structured data in columns is the basis for faceted result groupings.
- Indexing unstructured data is the basis for keyword clustering and grouping.
- Indexing structured and unstructured data can enable better search queries by augmenting unstructured keywords with related structured metadata.

Results: Search engines provide results by access the index that was built by crawling and processing source content.

Filtering: Faceted results based on structured data indexing facilitate refinement and filtering.

Exploration: Keyword cluster results based on unstructured data facilitate exploration and discovery of related information.

Describe the benefits of in-memory database to support BD&A solutions.
With emphasis on the following:

In-memory databases provided faster response times for applications that required extreme performance than traditional disk-based RDBMS. However, the drawback was that the data set had to exist entirely in-memory as most of these in-memory products had no ability to store inactive parts of tables on-disk like the traditional RDBMS vendors provided. Therefore, these in-memory databases tended to be used for small transaction processing types of applications or small data marts versus data warehouses, as the data set size and memory requirements did not lend themselves to the size and data volumes of larger data warehouse environments.

Analytic queries tend to access only a portion of a table's columns, rather than the entire row of data. On traditional row based table the result is many more I/O requests to bring the requested columns into memory, as the database must return pull rows of data from disk to access the desired columns of information. END RESULT - Poor application performance

In-memory access to data is much faster than disk I/O and although today's servers allow large memory configurations for database buffer caches, having to bring entire rows of data into memory (to access only a few columns) wastes memory and is not as efficient as a columnar table for analytic queries. END RESULT - Server resources (memory and cores) are inefficiently utilized and query performance is not optimized (better then with smaller amounts of memory, but not as efficient due to non-desired columns being stored in the database buffer cache).

Big data volumes are increasing faster than memory costs are dropping

In-memory databases can support a larger number of concurrent users. Responsiveness is higher because the data is resident in memory.

Section 3: IBM Big Data & Analytics Adoption

Explain how to leverage maturity models for IBM Big Data & Analytics to identify the customers current state and define the future progression. 
With emphasis on the following:

A Solution Advisor can leverage a variety of maturity models to support their assessment of an organization's stature for BD&A such as the IBM BD&A Maturity Model and the Analytics Quotient(AQ) Maturity Model. The Solution Advisor can use these models to solicit information about the organization that helps identify how prepared they are and how to most effectively adopt BD&A. The Solution Advisor can review the information with the organization to clearly identify their current state of ability to consume and manage information then apply appropriate analytics technology for new insights or decisions. The organization leaders can use the forward looking progression points of maturity to consider new levels of advancement they may want to attain and work with the Solution Advisor to outline next steps to make these improvements. The various models address different qualities about an organization and can be applied in combination or separately as effective.

The IBM Big Data & Analytics Maturity Model helps organizations assess their current capabilities in order to generate value from big data investments in support of strategic business initiatives. It does so by forming a considered assessment of the desired target state, identifying gaps, and provides guidance on the steps required to realize this end state. The BD&A Maturity model depicts 5 stages of progress in maturity for 6 different areas an organization needs to consider to support big data and analytics.

The 5 stages of progression include Ad Hoc, Foundational, Competitive, Differentiating, and Breakaway. These stages are designed to reflect both maturity levels of capability and degree of competitiveness an organization has.

The 6 areas include business strategy, information, analytics, culture and execution, architecture, and governance. These address a holistic set of areas from business and technology that affect an organizations ability to compete using BD&A.

Solution advisors can examine each stage against each measurement area with the customer to identify how well that customer performs across the different parts so they can best determine how competitive they are as a whole. They can then identify the aspects where they want to grow next and to what level of maturity over a defined period of time to chart a plan for their growth.

AQ Maturity Model measures your organization's readiness, ability and capacity to locate and apply insight, thereby re-orienting your business to make better decisions that deliver better outcomes. It measures your ability to act based on understanding history and context from the past, and your ability to make insightful forecasts and anticipate likely outcomes to optimize recommendations and judiciously automate decisions.

The AQ concept has two core components. The first is a numerical score that we calculate based on your answers to 15 multiple-choice questions. The second component is an AQ Maturity Model that maps these scores to one of four stages of increasing analytical maturity.

The four stages of Analytical Maturity are as follows: Novice, Builder, Leader, Master.

The Solution Advisor can devise a variety of techniques to apply the models and solicit information from the organization to understand their

Palestra aprovada no Microsoft Insights 2015

Você já viu neste blog que submeti (junto com meus amigos mega geeks Ivan Lima, Edvaldo Castro e Fabiano Amorim) uma série de palestras para o Microsoft Insights 2015 (antigo TechEd) e que a maioria foi recusada… Mas não todas! :-)

E se você me ouviu falar no último ano ou está um pouco antenado nesse blog, deve ter notado que estou indo um pouco além do relacional, estudando diversas outras coisas que são muito interessantes e importantes para pensar em uma miríade de soluções que podemos criar para tratar grandes questões de dados.

E no TechEd eu e Ivan Lima não estaremos falando de SQL Server on-premisse, internals, troubleshooting ou performance…

Ciência de dados com o Microsoft Azure (Big Data)

Ciência dos dados é uma das grandes frentes de investimento de todo grande executivo e, devido a sua escalabilidade, a nuvem se tornou o ambiente ideal para a implementação de soluções de forma ágil. Nessa sessão abordaremos o desenvolvimento de soluções escaláveis utilizando os serviços disponíveis no Microsoft Azure como HDInsight, Machine Learning e Data Factory.

Link: https://www.microsoftinsights.com.br/trail/TrailDetail/TrailId/2#lecture437

Esperamos que quem esteja no Insight possa participar da nossa sessão e que consigamos quebrar mitos, passar definições mais claras e, claro, fazer demonstrações com aplicações interessantes.

Muito feliz em estar palestrando neste grande evento da Microsoft no Brasil, que contando com os antigos TechEds, será pela sexta vez!

PS: Não estou deixando o SQL Server de lado, ou outro banco de dados relacional, só acho que hoje um arquiteto de dados, para se considerar bom e maduro, tem muito mais problemas à sua frente, e que precisam de respostas diferentes do que se propunha alguns anos atrás...

Abraços

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

Vídeo 02 – Certificações II

Mais um vídeo, continuando a falar sobre certificações, só que dessa vez falando do path de certificação para SQL Server, do finado programa MCM e até das provas de DB2.


Abraços,

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

Piecemeal restore com filestream

    Parte do trabalho de consultoria não é só trabalhar para definir as melhores soluções arquiteturais, mas trabalhar dentro dos limites do cliente (ex.: dinheiro, tempo, necessidade e efetividade) e uma vez definido o caminho a ser seguido, pensar em questões delicadas como, desempenho, recuperação de desastres, entre outros.
    O problema: em um cliente vamos utilizar o SQL Server filestream para armazenamento de imagens e espera-se receber para um único evento em torno de 16 milhões de imagens, com tamanho médio de 230 KB. Fazendo o cálculo nós temos aproximadamente 3,5 TB de imagens, fora os dados relacionados.
    Então considerando um eventual desastre com o banco de dados, não queremos esperar X horas com o sistema off-line até o restore finalizar, então foi proposto dividir os arquivos em regiões diferentes, permitindo a recuperação parcial do banco de dados. Para fins didáticos eu montei o script abaixo, que mostra como podemos fazer um piecemeal restore com filestream no SQL Server, e coloco nesse post para seguir como guia.
    Observação importante: não vou entrar no mérito de onde deve ser armazenado as imagens, não estou preocupado com alinhamento dos índices particionados, partition switching ou retirada dos dados desse banco, evitando sua obesidade. Claro que são questões importantes e que devem ser pensadas, mas isso faço junto com o cliente e colocar tudo aqui ia deixar esse artigo muito grande.

Estrutura do banco de dados

    O banco de dados possui o filegroup (FG) primário só com os objetos definidos na model e um filegroup para armazenar o restante dos dados do banco, que não sejam imagens. Até o momento não houve necessidade de particionar outros objetos ou dividir esse arquivo de dados. Além disso temos 5 FGs, uma para cada região que será definida, responsável por armazenar as imagens.

Script 01 – Definição do banco de dados

USE master
GO

exec sp_configure 'filestream access level', 2
RECONFIGURE

-- 2) Cria banco de dados com Filestream e tabela que utiliza recurso
IF EXISTS (SELECT * FROM sys.databases WHERE name = N'NimbusFS')
  DROP DATABASE NimbusFS
GO

CREATE DATABASE NimbusFS ON PRIMARY
  ( NAME = NimbusFS_Primary, 
    FILENAME = N'D:\Temp\SQLData\NimbusFS.mdf',
    SIZE = 15MB,
    MAXSIZE = 50MB, 
    FILEGROWTH = 15MB),
FILEGROUP FGNimbusFSData DEFAULT
  ( NAME = NimbusFS_Data, 
    FILENAME = N'D:\Temp\SQLData\NimbusFS_Data.mdf',
    SIZE = 100MB,
    MAXSIZE = 10GB, 
    FILEGROWTH = 100MB) ,
FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM
  ( NAME = NimbusFS_FSG01, 
    FILENAME = N'D:\Temp\SQLData\FSG01'),
FILEGROUP FileStreamGroup2 CONTAINS FILESTREAM
  ( NAME = NimbusFS_FSG02, 
    FILENAME = N'D:\Temp\SQLData\FSG02'),
FILEGROUP FileStreamGroup3 CONTAINS FILESTREAM
  ( NAME = NimbusFS_FSG03, 
    FILENAME = N'D:\Temp\SQLData\FSG03'),
FILEGROUP FileStreamGroup4 CONTAINS FILESTREAM
  ( NAME = NimbusFS_FSG04, 
    FILENAME = N'D:\Temp\SQLData\FSG04'),
FILEGROUP FileStreamGroup5 CONTAINS FILESTREAM
  ( NAME = NimbusFS_FSG05, 
    FILENAME = N'D:\Temp\SQLData\FSG05')
LOG ON 
  ( NAME = 'NimbusFS_log', 
    FILENAME = N'D:\Temp\SQLLog\NimbusFS.ldf',
    SIZE = 50MB, 
    MAXSIZE = 1GB, 
    FILEGROWTH = 10MB);
GO

USE NimbusFS
GO

ALTER DATABASE NimbusFS
SET RECOVERY FULL
GO

    A tabela região é definida pensando nas 5 regiões do Brasil, mas notem que ela é um termo abstrato e, caso seja necessário, podemos definir nossas “pseudo-regiões”, separando São Paulo ou outros estados com maior volume de dados, ou como o negócio ditar.
    Uma função de particionamento para as regiões é utilizada em 2 esquemas. O primeiro não separa os dados em filegroups diferentes, já o segundo separa os arquivos em diferentes FGs que contém filestream. Depois a tabela Documento é criada, seguindo as definições do filestream e particionando pelo atributo IDRegiao, seguindo os esquemas previamente definidos.
    Talvez você deva estar se perguntando porque não utilizamos o tempo no particionamento, o que é mais comum. Porém neste problema o tempo não é significativo, pois potencialmente um conjunto muito grande de imagens estarão dentro de uma mesma partição, que seria a ativa para o negócio, o que não ajudaria a diminuir o tempo de recuperação.

Script 02 – Definição do banco de dados

IF (OBJECT_ID('dbo.Regiao') IS NOT NULL)

       DROP TABLE Regiao

GO


-- Filegroup default

CREATE TABLE dbo.Regiao

(ID INT IDENTITY NOT NULL PRIMARY KEY

 , Nome VARCHAR(100))

go


INSERT INTO dbo.Regiao (Nome) values ('Norte'), ('Sul'), ('Nordeste'), ('Sudeste'), ('Centro-Oeste')

go


IF (OBJECT_ID('dbo.Documento') IS NOT NULL)

       DROP TABLE Documento

GO


CREATE PARTITION FUNCTION MyPartFunction (INT) AS RANGE LEFT FOR VALUES (1,2,3,4);

CREATE PARTITION SCHEME MyPartScheme AS PARTITIONMyPartFunction ALL TO (FGNimbusFSData);

CREATE PARTITION SCHEME MyPartSchemeFS AS PARTITIONMyPartFunction TO (FileStreamGroup1, FileStreamGroup2,FileStreamGroup3, FileStreamGroup4, FileStreamGroup5);


CREATE TABLE dbo.Documento

(ID INT IDENTITY NOT NULL

 , ChaveDoc UNIQUEIDENTIFIERNOT NULL DEFAULT NEWID() ROWGUIDCOL UNIQUE ONFGNimbusFSData

 , IDFK INT NOT NULL

 , IDRegiao INT NOT NULL

 , Doc VARBINARY(MAX) FILESTREAM

) ON MyPartScheme(IDRegiao)

FILESTREAM_ON MyPartSchemeFS;

go


ALTER TABLE Documento

ADDCONSTRAINT FK_Documento_Regiao_IDRegiao

FOREIGN KEY (IDRegiao)

REFERENCES dbo.Regiao (ID)


go


Inserindo dados e realizando backups

    Uma vez com a estrutura criada, os próximos passos são auto explicativos. Vou inserindo registros no banco de dados, em diferentes regiões, seguidas de um backup full, inserts, backup do filegroup e por fim, um backup do fim do log, que coloquei lá para representar um cenário onde perdemos os dados mas ainda consigo um “tail do log”’.
    Enquanto você estiver executando os passos desse script, recomendo dar uma olhada nos diretórios do filestream. Assim que um registro da região é inserido, o arquivo aparece no filegroup da partição, mostrando claramente o “roteamento” dos dados.

Script 03 – Dados e backups

INSERTINTO Documento (IDFK, IDRegiao, Doc) VALUES (111, 1, CAST('Doc 111 - Regiao 1' ASvarbinary(MAX)));

INSERTINTO Documento (IDFK, IDRegiao, Doc) VALUES (222, 2, CAST('Doc 222 - Regiao 2' ASvarbinary(MAX)));

INSERTINTO Documento (IDFK, IDRegiao, Doc) VALUES (333, 3, CAST('Doc 333 - Regiao 3' ASvarbinary(MAX)));

INSERTINTO Documento (IDFK, IDRegiao, Doc) VALUES (444, 4, CAST('Doc 444 - Regiao 4' ASvarbinary(MAX)));

INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (555, 5, CAST('Doc 555 - Regiao 5' AS varbinary(MAX)));


BACKUP DATABASE NimbusFS

TO DISK = 'D:\Temp\SQLBackup\NimbusFSFull.bak'

WITH INIT


INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (112, 1, CAST('Doc 112 - Regiao 1' AS varbinary(MAX)));

INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (223, 2, CAST('Doc 222 - Regiao 2' AS varbinary(MAX)));

INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (334, 3, CAST('Doc 333 - Regiao 3' AS varbinary(MAX)));

INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (445, 4, CAST('Doc 444 - Regiao 4' AS varbinary(MAX)));

INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (556, 5, CAST('Doc 555 - Regiao 5' AS varbinary(MAX)));

GO


SELECT * FROM Documento

GO


BACKUP DATABASE NimbusFS

FILEGROUP = 'PRIMARY'

TO DISK = 'D:\Temp\SQLBackup\NimbusFS_Primary.bak'

WITH INIT

go


BACKUP DATABASE NimbusFS

FILEGROUP = 'FGNimbusFSData'

TO DISK = 'D:\Temp\SQLBackup\NimbusFS_FSData.bak'

WITH INIT

go


INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (113, 1, CAST('Doc 113 - Regiao 1' AS varbinary(MAX)));


BACKUP DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup1'

TO DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG1.bak'

WITH INIT

GO


INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (224, 2, CAST('Doc 222 - Regiao 2' AS varbinary(MAX)));


BACKUP DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup2'

TO DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG2.bak'

WITH INIT

GO


INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (335, 3, CAST('Doc 333 - Regiao 3' AS varbinary(MAX)));


BACKUP DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup3'

TO DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG3.bak'

WITH INIT

GO


INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (446, 4, CAST('Doc 444 - Regiao 4' AS varbinary(MAX)));


BACKUP DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup4'

TO DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG4.bak'

WITH INIT

GO


INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (557, 5, CAST('Doc 555 - Regiao 5' AS varbinary(MAX)));


BACKUP DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup5'

TO DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG5.bak'

WITH INIT

GO


INSERT INTO Documento (IDFK, IDRegiao, Doc) VALUES (558, 5, CAST('Doc 555 - Regiao 5' AS varbinary(MAX)));


SELECT * FROM Documento

GO


-- "Tail do log"

BACKUP LOG NimbusFS

TO DISK = 'D:\Temp\SQLBackup\NimbusFS_LOG01.trn'


WITH INIT;


Recuperando de um desastre

    Supondo que neste momento você perdeu seu banco de dados e tem que partir para o restore. Você não gostaria de esperar todo o restore terminar para indicar ao negócio que todos podem voltar ao trabalho, mas sim permitir que as pessoas voltem a trabalhar aos poucos, minimizando o impacto para as regiões mais críticas. Certo?
    Então se você fez seu dever de casa certinho, pode fazer um piecemeal restore...

Script 04 – Iniciando o piecemeal restore

USE master

GO


-- Simula o desastre... Ou um DBA brilhando... :-)

DROPDATABASE NimbusFS;

GO


-- Início do piecemeal restore

RESTOREDATABASE NimbusFS

FILEGROUP= 'PRIMARY'

FROMDISK = 'D:\Temp\SQLBackup\NimbusFS_Primary.bak'

WITH PARTIAL, NORECOVERY


RESTORE DATABASE NimbusFS

FILEGROUP = 'FGNimbusFSData'

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_FSData.bak'

WITH NORECOVERY


RESTORE LOG NimbusFSFROM DISK = 'D:\Temp\SQLBackup\NimbusFS_LOG01.trn'

    Temos restaurado o filegroup primário e o FG com os dados (sem as imagens), colocando-os em um estado consistente. A partir deste momento você já tem acesso ao banco e alguma funcionalidade pode ser utilizada.
    Consultas às tabelas de sistemas, tabelas que não usam o filestream e até as colunas da tabela que não são filestream, podem ser manipuladas. Porém quando você tentar acessar um dado que está em filegroup ainda não restaurado, vai ver a mensagem 670, conforme abaixo.

Script 05 – Utilizando parcialmente o banco

SELECT * FROM NimbusFS.sys.objects WHERE TYPE = 'U';

SELECT * FROM NimbusFS.dbo.Regiao;

SELECT * FROM NimbusFS.dbo.Documento;


/*

Msg 670, Level 16, State 1, Line 177

Large object (LOB) data for table "dbo.Documento" resides on an offline filegroup ("FileStreamGroup1") that cannot be accessed.

*/



SELECT id, ChaveDoc, IDRegiao FROMNimbusFS.dbo.Documento;


    A partir deste momento sua escolha é decidir qual região deve voltar primeiro (em geral isso está definido no plano de DR que você escreveu, mas enfim...). O exemplo abaixo deixa online a região 2 e faz consulta filtrando pela região, que funciona perfeitamente, mas não para as outras regiões.

Script 06 – Restore da região 02

-- Regiao 02

RESTORE DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup2'

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG2.bak'

WITH NORECOVERY


RESTORE LOG NimbusFS

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_LOG01.trn'

GO


-- Erro

SELECT* FROM NimbusFS.dbo.Documento;

-- E...


SELECT * FROM NimbusFS.dbo.Documento WHEREIDRegiao = 2;


    Continua-se o procedimento de recuperação até que todos os filegroups estejam online. Vale notar também que a última inserção, que estava no backup do transaction log, também é recuperada com sucesso.

Script 07 – Restore das outras regiões

-- Regiao 01

RESTORE DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup1'

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG1.bak'

WITH NORECOVERY


RESTORE LOG NimbusFS

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_LOG01.trn'

GO


-- Erro

SELECT* FROM NimbusFS.dbo.Documento;

-- E...

SELECT * FROM NimbusFS.dbo.Documento WHEREIDRegiao in (1,2);


-- Regiao 04

RESTORE DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup4'

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG4.bak'

WITH NORECOVERY


RESTORE LOG NimbusFS

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_LOG01.trn'

GO


-- Erro

SELECT* FROM NimbusFS.dbo.Documento;

-- E...

SELECT * FROM NimbusFS.dbo.Documento WHEREIDRegiao in (1,2,4);



-- Regiao 03

RESTORE DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup3'

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG3.bak'

WITH NORECOVERY


RESTORE LOG NimbusFS

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_LOG01.trn'

GO


-- Erro

SELECT* FROM NimbusFS.dbo.Documento;

-- E...

SELECT * FROM NimbusFS.dbo.Documento WHEREIDRegiao in (1,2,3,4);



-- Regiao 05

RESTORE DATABASE NimbusFS

FILEGROUP = 'FileStreamGroup5'

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_FSG5.bak'

WITH NORECOVERY


RESTORE LOG NimbusFS

FROM DISK = 'D:\Temp\SQLBackup\NimbusFS_LOG01.trn'

GO


SELECT * FROM NimbusFS.dbo.Documento;

-- E...


SELECT * FROM NimbusFS.dbo.Documento WHEREIDRegiao in (1,2,3,4,5);


    Neste momento o seu banco de dados está todo online! Vale ressaltar que o processo de recuperação como um todo vai levar mais tempo, porém você consegue ir aos poucos permitindo que regiões da sua empresa volte a trabalhar, sem ter que esperar todo o processo de DR.

    Gostou? Espero que sim! E foi bom voltar a postar posts puramente técnicos... Que agora virão com maior frequência, prometo.

Quer o script? Pode fazer o download do OneDrive.

Abraços,

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

[SQLServerDF] Material XXV – Paralelismo no SQL Server

Faz alguns dias eu apresentei no SQLServerDF sobre paralelismo no SQL Server, então deixo aqui o material (PDF e scripts).
Conforme informei para o pessoal, a sessão em breve vai virar um treinamento on-demand da Nimbus, com um pouco mais conteúdo, então reaproveitei o PPT, pois não tive tempo para alterar.

Você pode baixar o material que está no OneDrive.

Abraços,

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

Site guia do DBA e 50% de desconto em um treinamento Nimbus

Ontem recebi um e-mail com uma iniciativa recente voltada para banco de dados e DBAs, o http://guiadba.com.br/.
Não conhecia o site, mas como disse a eles, toda iniciativa de divulgar conhecimento é válida, e espero que eles consigam manter a regularidade e postar conteúdo de qualidade.

Como forma de apoio a Sr. Nimbus forneceu 50% de desconto em qualquer treinamento em 2015, que eles vão sortear na próxima semana, então é uma boa oportunidade para você participar. Espero que quem receber o desconto possa participar da próxima turma e entender porque todos falam tão bem dos treinamentos da Nimbus (http://www.srnimbus.com.br/turmas/)! Sim, aqui a modéstia passou longe…

Abraços,

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.brwww.twitter.com/luticm
www.srnimbus.com.br

[SQLServerDF] Encontro XXV – Paralelismo no SQL Server

Para marcar o encontro número 25 do grupo SQLServerDF, voltarei a fazer uma palestra sobre SQL Server… PQP, até que enfim! Vou usar como base a minha palestra dos últimos SQLSats e incluir mais material.

Aproveitando, no início desse ano eu conheci um novo centro de treinamento em Brasília que gostei bastante (inclusive vou usar no próximo treinamento da Nimbus em Brasília) e resolvi fazer uma experiência com o SQLServerDF, promovendo um encontro na XTC. Estamos animados com a ideia, espero que vocês gostem do local.

Também saindo do padrão, o encontro será na segunda-feira, pois na quarta eu farei uma cirurgia que me deixará alguns dias de licença médica.
De resto vocês já sabem o que fazer, por favor confirmar presença com nome e e-mail no google groups. Para aqueles que não estão no grupo, basta ir até http://groups.google.com/group/sqlserverdf, fazer sua inscrição e aguardar minha moderação

Data e horário: 09/03/2015, entre 18:00h e 20:00h
Local: Xperts Trainning Center
Palestrante: Luciano [Luti] Moreira
Título: Paralelismo no SQL Server
Descrição: Um assunto recorrente em discussões sobre SQL Server é o uso do paralelismo, e sempre fica a pergunta, é bom ou ruim? Nessa sessão vamos ver detalhes do paralelismo no SQL Server, configurações que influenciam seu uso (MAXDOP = 1?), entender o famoso wait type CXPACKET e analisar diversas consultas paralelas. Ao fim da sessão você vai poder olhar para o paralelismo com um olhar crítico e saber como utilizá-lo efetivamente.
Mini-cv do palestrante:
Luciano Moreira é sócio fundador da Sr. Nimbus. Especialista e MVP em SQL Server, vem buscando nos últimos anos explorar os detalhes de outros bancos de dados, relacionais ou não, além de estudar assuntos relacionados à arquitetura de soluções e ciência de dados. Divide seu tempo como profissional entre consultorias, treinamentos e comunidade técnica, ajudando empresas a projetar soluções, utilizar de forma eficiente os produtos e, claro, resolver problemas que envolve banco de dados..

Xperts Trainning Center
SHIS QI 15 Conjunto 8/9 Área especial Bloco D, Subsolo - Lago Sul  (ao final da rua)
CEP 71635-565 - Brasília - DF
Telefone: (61) 4063-8177 | 9545-9241
Ponto de referência: Próximo ao Hospital Brasília

Abraços,

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br
Luciano Moreira's RSS Feed
Go to Top